3

I am trying to determine what Powershell command would be equivalent to the following Linux Command for creation of a large file in a reasonable time with exact size AND populated with the given text input.

Given:

$ cat line.txt
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ

$ time yes `cat line.txt` | head -c 10GB > file.txt  # create large file
real    0m59.741s

$ ls -lt file.txt
-rw-r--r--+ 1 k None 10000000000 Feb  2 16:28 file.txt

$ head -3 file.txt
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ZZZZ

What would be the most efficient, compact Powershell command which would allow me to specify the size, text content and create the file as the Linux command above? Thanks! Original ask here was automatically closed for some reason

mojoa
  • 103
  • 1
  • 8
  • This has be talked about in a few places online: [powershell 'create a large file with specific data'](https://www.bing.com/search?q=powershell%20%27create%20a%20large%20file%20with%20specific%20data%27&qs=n&form=QBRE&sp=-1&pq=powershell%20%27create%20a%20large%20file%20with%20specific%20data%27&sc=0-51&sk=&cvid=B5F619CC304546299B260F2A52EE39C2) – postanote Feb 03 '21 at 01:50
  • It wasn’t automatically closed, it was intentionally closed for being duplicate. It says as such on that post. It also suggested you edit it to be an appropriate, on topic, non duplicate manner to reopen it. – Doug Maurer Feb 03 '21 at 02:02
  • 1
    My apologies. I edited the original and it did not look like it was reopening. I thought the only recourse was to ask another question. I apparently did not make the question clear enough to include the test file input as a requirement and the answers did not reflect it, so I added it. Nothing I have read has shown this question to be duplicate, so I must be missing something. Seems like a simple enough question, If I could just get a link which meets the requirements I would sure appreciate it. – mojoa Feb 03 '21 at 02:23
  • 1
    @DougMaurer, yes, the earlier question was closed _based on user votes_, but it was closed _inappropriately_, because it contained a specific requirement not addressed by the alleged duplicate. Given that the only answer to the original question was one that also ignored the specific requirement and just answered the alleged duplicate question, it makes sense to create a new question that makes the specific requirement more explicit, as has happened here. – mklement0 Feb 03 '21 at 02:33
  • @mklement0 and zett42 Since you collaborated, and considering my esteemed reputation and experience, I need some guidance on how to properly close this out. You have helped me and probably others with your Super good work! – mojoa Feb 05 '21 at 03:50
  • I appreciate your asking, mojoa. You should choose based on what you think best serves future readers; here's some guidance: Is the fundamental approach, accompanied by background information more helpful, or is the performance-optimized variation more important? /cc @zett42 – mklement0 Feb 05 '21 at 14:21
  • 1
    @mklement0 Thanks for your extracurricular help! As the apparent originator of the code, I selected your answer. Actually, both answers were perfect and to be much appreciated by all I am sure. Too bad I could only choose one answer. Thank you both.! `cc @zett42` – mojoa Feb 05 '21 at 19:25
  • Thank you for your thoughtful actions and comments. Yes, sometimes the choice is tough, but if a different answer turns out to be more helpful on balance, over time, the up-votes will eventually reflect that (and the two candidate answers now point to each other, which should help too). /cc @zett42 – mklement0 Feb 05 '21 at 20:08

3 Answers3

3

There is no direct PowerShell equivalent of your command.

In fact, with files of this size your best bet is to avoid PowerShell's own cmdlets and pipeline and to make direct use of .NET types instead:

& {
  param($outFile, $size, $content)

  # Add a newline to the input string, if needed.
  $line = $content + "`n"

  # Calculate how often the line must be repeated (including trailing newline)
  # to reach the target size.
  [long] $remainder = 0
  $iterations = [math]::DivRem($size, $line.Length, [ref] $remainder)

  # Create the output file.
  $outFileInfo = New-Item -Force $outFile
  $fs = [System.IO.StreamWriter] $outFileInfo.FullName

  # Fill it with duplicates of the line.
  foreach ($i in 1..$iterations) {
    $fs.Write($line)
  }

  # If a partial line is needed to reach the exact target size, write it now.
  if ($remainder) {
    $fs.Write($line.Substring(0, $remainder))
  }

  $fs.Close()
  
} file.txt 1e10 (Get-Content line.txt)

Note: 1e10 uses PowerShell's support for scientific notation as a shorthand for 10000000000 (10,000,000,000, i.e., [Math]::Pow(10, 10). Note that PowerShell also has built-in support for byte-multiplier suffixes - kb, mb, gb and tb - but they are binary multipliers, so that 10gb is equivalent to 10,737,418,240 (10 * [math]::Pow(1024, 3)), not decimal 10,000,000,000.

Note:

  • The size passed (1e10 in this case) is a character count, not a byte count. Given that .NET's file I/O APIs use BOM-less UTF-8 encoding by default, the two counts will only be equal if you restrict the input string to fill the file with to characters in the ASCII range (code points 0x0 - 0x7f).

  • The last instance of the input string may be cut off (without a trailing newline) if the total characters count isn't an exact multiple of the input string length + 1 (for the newline).

  • Optimizing the performance of this code by up to 20% is possible, through a combination of writing bytes and output buffering, as shown in zett42's helpful answer.

The above performs reasonably well by PowerShell standards.

In general, PowerShell's object-oriented nature will never match the speed of the raw byte handling provided by native Unix utilities / shells.

It wouldn't be hard to turn the code above into a reusable function; in
a nutshell, replace & { ... } with something like function New-FileOfSize { ... } and call New-FileOfSize file.txt 1gb (Get-Content line.txt) - see the conceptual about_Functions help topic, and about_Functions_Advanced for how to make the function more sophisticated.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    For possible performance improvements, one could experiment with: 1) specify a larger buffer size (the default is only 4 KiB AFAIK) and 2) encode the string only once and use `FileStream` directly. – zett42 Feb 03 '21 at 18:03
  • 1
    I did [some measurements](https://stackoverflow.com/a/66037174/7571258). There is a notable, but not spectacular performance improvement. – zett42 Feb 03 '21 at 23:21
  • Is there an advantage in calling `New-Item` instead of letting the .NET API create the file? The former produces cleaner error messages, but I don't like the potential race condition, e. g. the file could be deleted (or opened by another process, blocking write access) between the call to `New-Item` and the .NET API that opens it. – zett42 Feb 04 '21 at 23:04
  • @zett42, the only reason I'm using it is so I can reliably get a _full path_, which is needed for calling .NET methods, given that .NET's current dir. usually differs from PowerShell's. Ideally, `Convert-Path` could be used, but - regrettably - it only works with _existing_ files or folders - see [GitHub issue #2993](https://github.com/PowerShell/PowerShell/issues/2993). In .NET Core only (therefore not in Windows PowerShell), you could use `[System.IO.Path]::GetFullPath($outFile, $PWD.ProviderPath)`. – mklement0 Feb 04 '21 at 23:16
  • Is `$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath()` a viable alternative? – zett42 Feb 04 '21 at 23:19
  • Good find, @zett42 - that's definitely _viable_, but also highly _obscure_. That said, wrapped inside a function you may choose to use it. However, note that there is _no_ race condition here (at least not created by the function itself): `New-Item` is synchronous, and the `[System.IO.StreamWriter]` constructor doesn't care whether the file already exists or not - it either creates the file or truncates it. – mklement0 Feb 04 '21 at 23:40
  • I didn't mean a race condition between the PowerShell and .NET API calls. But there is a (albeit small) time window between the call to `New-Item` and the `[System.IO.StreamWriter]` constructor where another thread or process could mess with the file. – zett42 Feb 04 '21 at 23:53
  • 1
    @zett42, yes, but I don't think that matters here: the function makes no guarantees as to when it tries to claim the file, and the sole reason for using `New-Item` is to determine the full path. Sure, by trying to claim the file _twice_ - first by `NewItem`, and then again by the `System.IO.StreamWriter]` constructor, there is a hypothetical additional potential point of failure, but I don't think concurrency concerns are in the picture here. I agree that _technically_ `$ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath() ` is the better approach, but it's so obscure... – mklement0 Feb 05 '21 at 00:40
1

A slightly optimized version of mklement0's script.

  • Encode the string only once at the beginning.
  • Use System.IO.FileStream instead of System.IO.StreamWriter to write raw bytes instead of a string which has to be encoded first.
  • Use a larger buffer than the default one of StreamWriter which is rather small. A size of 1 MiB seems to be in the sweet spot on my machine. A 2 MiB buffer is already slower, propably due to worse caching behaviour. It may vary on your machine.
  • Unrelated to performance, a line feed character is no longer added to the input string $content. If needed, it can be added to the argument by the user. To make this possible I have added argument -raw to the Get-Content call.
& {
    param($outFile, $size, $content)
  
    # Encode the input string as UTF-8
    $encoding = [Text.UTF8Encoding]::new()
    $contentBytes = $encoding.GetBytes( $content )
  
    # Calculate how often the content must be repeated (including trailing newline)
    # to reach the target size.
    [long] $remainder = 0
    $iterations = [math]::DivRem($size, $contentBytes.Length, [ref] $remainder)
  
    # Convert the PowerShell path to a full path for use by .NET API.
    # .NET can't use a relative PowerShell path as its current directory may differ from
    # PowerShells current directory.
    $fullPath = $ExecutionContext.SessionState.Path.GetUnresolvedProviderPathFromPSPath( $outFile )

    # Create a file stream with a large buffer size for improved performance.
    $bufferSize = 1MB
    $stream = [IO.FileStream]::new( $fullPath, [IO.FileMode]::Create, [IO.FileAccess]::Write, 
                                    [IO.FileShare]::Read, $bufferSize )

    try {
        # Fill it with duplicates of the content.
        foreach ($i in 1..$iterations) {
            $stream.Write($contentBytes, 0, $contentBytes.Length)
        }
      
        # If a sub string of the content is needed to reach the exact target size, write it now. 
        # Note this may create an invalid UTF-8 code point at the end, depending on
        # the input. Basic ASCII is no problem.
        if ($remainder) {
            $stream.Write($contentBytes, 0, $remainder)
        } 
    }
    finally {
        # Close the stream even when an exception has been thrown.
        $stream.Close()
    }    
} file.txt 1gb (Get-Content -raw line.txt) 

For testing the script was used to create a 1 GB file, with OPs test content (99 characters + LF). For each test, average MiB/s of 100 runs was calculated:

$duration = (1..100 | %{ (Measure-Command { .\Test.ps1 }).TotalSeconds } | Measure-Object -Average).Average
"$(1024 / $duration) MiB/s"

Test results:

Script Buffer size MiB/s
mklement0's script default 438
optimized script 4 KiB 434
optimized script 16 KiB 483
optimized script 64 KiB 521
optimized script 256 KiB 524
optimized script 1 MiB 528
optimized script 2 MiB 526

So in the best case we have a ~20% increase in performance. Not spectacular, but still noticable.

The values look quite good, when compared with SSD performance measured by winsat:

> winsat disk -seq -write -drive x
Disk  Sequential 64.0 Write                  496.03 MB/s
zett42
  • 25,437
  • 3
  • 35
  • 72
  • This is great work and will allow me to create a large file on a corporate workstation for testing. I was able to create a > 10GB file in less than 60 seconds. I am wondering about the file size though. I created line.txt as a 100 byte input file, and was able to create an exact size 10GB file as shown in the question. The last line in the solution file.txt appears to be incomplete and the total file size is 10737418240 for file.txt. Is there some reason behind that? you mention 96 characters, should be 100. ASCII 32-127 for line.txt.. Sorry for any confusion I may have caused. – mojoa Feb 04 '21 at 02:01
  • @mojoa The 96 characters was a mistake on my side. I've also removed the added line feed from the code and use `Get-Content -raw` to include the LF from the file instead. The last incomplete line is correct though, because 10 GB is not evenly dividable by 100. Type this in the PowerShell console: `10GB / 100`. – zett42 Feb 04 '21 at 09:12
  • Not sure about how PowerShell calculates bytes. As shown in the Question, I was able to create exactly 10000000000 byte file.txt by appending 100 byte line.txt. The Linux wc command and the Windows DIR also verifies this. – mojoa Feb 04 '21 at 15:51
  • 1
    @mojoa See what mklement0 wrote in the note under his code. Replace "10GB" by "1e10". – zett42 Feb 04 '21 at 15:58
  • So, would a complete PowerShell solution for exact byte count need to include a function which counts lines and truncates extraneous, like Linux head does? Maybe as a post processing step? That would increase the actual file creation time i presume. – mojoa Feb 04 '21 at 20:17
  • @mojoa I'm not sure what you mean. If in the last line of my script `} file.txt 1gb (Get-Content -raw line.txt)`, you replace `1gb` by `1e10` or `10000000000`, you will get a file with that exact size, where the last line is not truncated. – zett42 Feb 04 '21 at 20:45
  • Perfect, Thanks. – mojoa Feb 05 '21 at 04:06
0

Continuing from my comment.

There is no command to do this. You have to code it.

Just from the info I point via the search. In PowerShell proper, a quick take on your use case would be like taking this approach.

Function New-EmptyFile
{
<#
.Synopsis
    Create a new empty file 
.DESCRIPTION
    This function creates a new file of the given size
.EXAMPLE
    New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb

.EXAMPLE
    nef 'D:\Temp\nef.txt' 10mb

.NOTES
    You can modify data in the file this way
    (Get-Content -path 'D:\Temp\nef.txt' -Raw) -replace '\.*','white' | 
    Set-Content -Path 'D:\Temp\nef.txt'    
#>

    [cmdletbinding(SupportsShouldProcess)]
    [Alias('nef')]
    param
    (
        [string]$FilePath,
        [double]$Size
    )
 
    $file = [System.IO.File]::Create($FilePath)
    $file.SetLength($Size)
    $file.Close()

    Get-Item $file.Name
}

You could take this:

(Get-Content -path 'D:\Temp\nef.txt' -Raw) -replace '\.*','white' | 
Set-Content -Path 'D:\Temp\nef.txt'

... and make it part of the function. Something like this:

Function New-EmptyFile
{
<#
.Synopsis
    Create a new empty file 
.DESCRIPTION
    This function creates a new file of the given size
.EXAMPLE
    New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb

.EXAMPLE
    nef 'D:\Temp\nef.txt' 10mb

.NOTES
    Other notes here
 
#>

    [cmdletbinding(SupportsShouldProcess)]
    [Alias('nef')]
    param
    (
        [string]$FilePath,
        [double]$Size,
        [string]$FileData
    )
 
    $file = [System.IO.File]::Create($FilePath)
    $file.SetLength($Size)
    $file.Close()

    Get-Item $file.Name

    If ($FileData)
    {
        (Get-Content -Path (Get-Item $file.Name).FullName -Raw) -replace '\.*',$FileData | 
        Set-Content -Path (Get-Item $file.Name).FullName   
    }
}

New-EmptyFile -FilePath 'D:\Temp\nef.txt' -Size 10mb -FileData 'The quick brown fox.'

However, when dealing with large files, performance specifically means using the .Net namespace.

None of the above is an exact replacement of what you posted, so, you will need to tweak as needed.

See this write-up

Reading large text files with Powershell

postanote
  • 15,138
  • 2
  • 14
  • 25