1

I am attempting to solve the following problem:

Given a number of similarly formatted text files (~800mb worth of them, in my case), retrieve all lines in them and delete duplicates.

I attempted to solve this problem by running this command:

cat *.txt | Sort-Object -unique >output.txt

Then, powershell quickly consumed all my available RAM (over 16gb) and ran for over 20 minutes without writing anything into the output file. I then ran cat *.txt >output.log to rule out the possibility of shell reading the file it was writing to, but that command still maxed out all RAM and produced almost no output.

Why did this happen? How can 800mb of files on disk consume all RAM when concatenating?

How to solve this problem with powershell more efficiently?

Value of $PSVersionTable, if that helps:

Name                           Value
----                           -----
PSVersion                      5.1.19041.1682
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.19041.1682
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1

Thanks in advance.

  • 1
    `Get-Content` is known to be very inefficient for large amounts of data. Try using [filestreams](https://stackoverflow.com/a/47352340/9529842) instead. – boxdog Sep 01 '22 at 08:39
  • Ah, so cat in PS is actually Get-Content, thanks, will try with filestreams. – Preposterone Sep 01 '22 at 09:01
  • Try `[IO.File]::ReadAllLines(path) | Sort-Object -Unique`. That should make it faster and possibly reduce memory usage as well, as `Get-Content` adds custom data to each line, such as the file path. `ReadAllLines` only outputs the raw strings. In any case, you have to collect all unique lines in memory to be able to sort them. – zett42 Sep 01 '22 at 09:46
  • @zett42, thank you for a practical solution. I have attempted to use it and still ended up with powershell maxing out all my RAM and still not producing any output. I still don't understand, why does powershell need all my RAM to sort a 900 mb file? If I write a python script that does the same, it will not require this many resources. – Preposterone Sep 01 '22 at 09:59
  • Is casing important? Eg. do you consider the lines "Hello" and "hello" the same or different? – Mathias R. Jessen Sep 01 '22 at 10:07
  • @MathiasR.Jessen, casing is important, yes. – Preposterone Sep 01 '22 at 10:08
  • 1
    See [`#11221` Select-Object -Unique is unnecessary slow and exhaustive](https://github.com/PowerShell/PowerShell/issues/11221) and [How to sort 30Million csv records in Powershell](https://stackoverflow.com/q/66057891/1701026) – iRon Sep 01 '22 at 11:52

1 Answers1

2

Why did this happen? How can 800mb of files on disk consume all RAM when concatenating?

There's a number of reasons the original number of bytes might explode when read into runtime memory - strings read from an ASCII-encoded file will automatically take up twice the amount of memory because .NET internally uses 2 bytes to represent each character. Additionally you need to account for the number of individual strings you instantiate, each will need space for at least one 8-byte reference as well.

The major problem with your current approach, however, is that every single line read from files must stay resident in memory until PowerShell can perform the sort operation and discard duplicates - that's just how Sort-Object works.

How to solve this problem with powershell more efficiently?

To avoid this, use a data type optimized for only storing unique values: the [HashSet[T]] class!

$uniqueStrings = [System.Collections.Generic.HashSet[string]]::new()
cat *.txt |ForEach-Object {
    [void]$uniqueStrings.Add($_)
}

$uniqueStrings |Set-Content output.txt

When you call Add(), the hashset will inspect the new string value and test if it already has a copy of the same string value. If so, it simply discards the new value, meaning your script no longer references duplicate strings in memory, and the runtime can clean them up.

If having the output sorted is also important, pipe the values through Sort-Object as the last step before writing to disk:

$uniqueStrings |Sort-Object |Set-Content output.txt
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • `Hashset` isn't sorted though, but it's not clear from the question, if that is important. If so, a [`SortedSet`](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.sortedset-1?view=net-6.0) could be used. – zett42 Sep 01 '22 at 10:22
  • @zett42 Indeed. It sounds more like "deduplicate input and write distinct values back to file" is the concern here, I'd personally suggest just doing `$uniqueStrings |Sort-Object |...` at the end if sorting is _also_ required – Mathias R. Jessen Sep 01 '22 at 10:24
  • Thanks for your answer, it is true, I do not care about the order. Get-content is still ridiculously slow, so I skipped concatenation and simply joined files using cmd's `copy /a *.txt newfile.txt`. And then ran your snipped although, a bit adapted: `$uniqueStrings = [System.Collections.Generic.HashSet[string]]::new(); [IO.File]::ReadLines('filepath', [Text.Encoding]::ASCII) | ForEach-Object { [void]$uniqueStrings.Add($_) }`. This seemed adequate perfomance-wise, however `$uniqueStrings |Set-Content output.txt`, again ate through all my RAM and produced no output after 25 minutes. – Preposterone Sep 01 '22 at 11:08
  • 1
    @Preposterone `[IO.File]::WriteAllLines("$PWD\output.txt", $uniqueStrings)`. And it's good that you use `ReadLines()` instead of my suggested `ReadAllLines()`, because the former uses much less memory. – zett42 Sep 01 '22 at 11:14
  • @Preposterone As another performance improvement, avoid `ForEach-Object` and use `foreach($line in [IO.File]::ReadLines('filepath', [Text.Encoding]::ASCII))` instead. Due to how script blocks are called by `ForEach-Object`, it is [notoriously slow](https://github.com/PowerShell/PowerShell/issues/10982). Alternatively pipe to a script block like this: `[IO.File]::ReadLines('filepath', [Text.Encoding]::ASCII) | & { process { <# do stuff #> }}` – zett42 Sep 01 '22 at 11:19
  • @zett42, thank you, your additions have helped immensely with speeding up this process. It's crazy, I think. Such a trivial task in bash would take several minutes, at most, with coreutils, took almost an hour of raw processing time, ate through all RAM and required using C# (?) standard library utilities to get a result. P.S.: the result is 41 mil lines, ~430 mb of storage. – Preposterone Sep 01 '22 at 11:32
  • @Preposterone Bash is based on text processing for chaining commands, so it's no wonder that it's fast for this use case. PoSh otoh uses an object-oriented approach that is very flexible (and IMO more intuitive to use, most of the time), but comes with a cost. This mostly matters only when doing bulk data processing of large files, as in this case. PoSh could be faster though and there are already many GitHub issues regarding this, which hopefully will be tackled at some time, so less use of .NET code will be necessary. – zett42 Sep 01 '22 at 13:46