2

I am using a macbook m1, and when I use the native wc -l $file I get results lightning fast, almost certainly because I can directly pass in the file. However, I don't see how I can do this with powershell and I am forced to funnel data to stdout which, even for wc is slow (but still faster than Measure-Object

Is there a faster way in powershell to get the number of lines in a file? Like wc -l $file fast?

PS /Users/cbongior/dev/oracle/dart-ingestion/omc> Measure-Command {Get-Content ./final_class2.csv | Measure-Object -Line}

Days              : 0
Hours             : 0
Minutes           : 1
Seconds           : 8
Milliseconds      : 684
Ticks             : 686847045
TotalDays         : 0.000794961857638889
TotalHours        : 0.0190790845833333
TotalMinutes      : 1.144745075
TotalSeconds      : 68.6847045
TotalMilliseconds : 68684.7045


PS /Users/cbongior/dev/oracle/dart-ingestion/omc> Measure-Command { wc -l ./final_class2.csv }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 166
Ticks             : 1663259
TotalDays         : 1.92506828703704E-06
TotalHours        : 4.62016388888889E-05
TotalMinutes      : 0.00277209833333333
TotalSeconds      : 0.1663259
TotalMilliseconds : 166.3259


PS /Users/cbongior/dev/oracle/dart-ingestion/omc> Measure-Command { cat ./final_class2.csv | wc -l  }

Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 51
Milliseconds      : 187
Ticks             : 511870216
TotalDays         : 0.00059244237962963
TotalHours        : 0.0142186171111111
TotalMinutes      : 0.853117026666667
TotalSeconds      : 51.1870216
TotalMilliseconds : 51187.0216


Christian Bongiorno
  • 5,150
  • 3
  • 38
  • 76
  • Just to state what perhaps is obvious: you can call `wc -l` directly from PowerShell. The `[System.Linq.Enumerable]::Count([System.IO.File]::ReadLines()` solution from Santiago's answer is actually faster in my tests, though. – mklement0 Jun 07 '23 at 18:39

1 Answers1

3

Get-Content is known to be slow as it attaches ETS Properties to each line. A straightforward and efficient way to do it would be to use File.ReadLines and Enumerable.Count:

[System.Linq.Enumerable]::Count(
    [System.IO.File]::ReadLines((Convert-Path ./final_class2.csv)))

Or just File.ReadAllLines:

[System.IO.File]::ReadAllLines((Convert-Path ./final_class2.csv)).Length
mklement0
  • 382,024
  • 64
  • 607
  • 775
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • 2
    Nice. The first solution (the LINQ-based one) isn't just more memory-friendly (albeit more complex), it is also faster (which came as a surprise to me; I guess the array creation is more expensive than the lazy enumeration), and in my tests even beats `wc -l` – mklement0 Jun 07 '23 at 18:38
  • 1
    @mklement0 I didnt personally test their performance but my heart told me the LINQ method would be faster and more efficient. Its a sad that extension methods don't translate as nicely in PowerShell as they do in C# – Santiago Squarzon Jun 07 '23 at 18:43
  • 1
    Indeed - I suspect that you know of the relevant feature request, but let me link to it for future readers: [GitHub issue #2226](https://github.com/PowerShell/PowerShell/issues/2226). – mklement0 Jun 07 '23 at 18:45
  • 1
    On my mac, `wc` is much faster, but usage of `Linq` is acceptably fast, I suspect this has a lot to do with the OS it's running on ```PS /Users/cbongior/dev/oracle/dart-ingestion/omc> Measure-Command { [System.Linq.Enumerable]::Count([System.IO.File]::ReadLines((Convert-Path ./all.csv))) } | %{$_.TotalSeconds} 1.541434 PS /Users/cbongior/dev/oracle/dart-ingestion/omc> Measure-Command { wc -l ./all.csv } | %{$_.TotalSeconds} 0.4622112 ``` – Christian Bongiorno Jun 07 '23 at 18:54
  • @ChristianBongiorno 1s slower :P fyi, running performance tests in PowerShell are in general not a good indication of how performant a method is, specially when the time difference is so small – Santiago Squarzon Jun 07 '23 at 19:10
  • @ChristianBongiorno, an important factor when comparing file-processing performance is _caching behavior_. You should always "warm up" the cache first to level the playing field (though, conceivably, how caching works via a native vs. a managed application can be an additional factor), and it's always worth averaging _multiple_ runs of PowerShell code. I encourage you to run the [these benchmarks](https://gist.github.com/mklement0/e8cabb620342af37ae7d0faecba7d588#file-bm_76419365-ps1). – mklement0 Jun 08 '23 at 01:29