21

I have 265 CSV files with over 4 million total records (lines), and need to do a search and replace in all the CSV files. I have a snippet of my PowerShell code below that does this, but it takes 17 minutes to perform the action:

ForEach ($file in Get-ChildItem C:\temp\csv\*.csv) 
{
    $content = Get-Content -path $file
    $content | foreach {$_ -replace $SearchStr, $ReplaceStr} | Set-Content $file
}

Now I have the following Python code that does the same thing but takes less than 1 minute to perform:

import os, fnmatch

def findReplace(directory, find, replace, filePattern):
    for path, dirs, files in os.walk(os.path.abspath(directory)):
        for filename in fnmatch.filter(files, filePattern):
            filepath = os.path.join(path, filename)
            with open(filepath) as f:
                s = f.read()
            s = s.replace(find, replace)
            with open(filepath, "w") as f:
                f.write(s)

findReplace("c:/temp/csv", "Search String", "Replace String", "*.csv")

Why is the Python method so much more efficient? Is my PowerShell code in-efficient, or is Python just a more powerful programming language when it comes to text manipulation?

Keith
  • 1,959
  • 10
  • 35
  • 46

5 Answers5

10

Give this PowerShell script a try. It should perform much better. Much less use of RAM too as the file is read in a buffered stream.

$reader = [IO.File]::OpenText("C:\input.csv")
$writer = New-Object System.IO.StreamWriter("C:\output.csv")

while ($reader.Peek() -ge 0) {
    $line = $reader.ReadLine()
    $line2 = $line -replace $SearchStr, $ReplaceStr
    $writer.writeline($line2)
}

$reader.Close()
$writer.Close()

This processes one file, but you can test performance with it and if its more acceptable add it to a loop.

Alternatively you can use Get-Content to read a number of lines into memory, perform the replacement and then write the updated chunk utilizing the PowerShell pipeline.

Get-Content "C:\input.csv" -ReadCount 512 | % {
    $_ -replace $SearchStr, $ReplaceStr
} | Set-Content "C:\output.csv"

To squeeze a little more performance you can also compile the regex (-replace uses regular expressions) like this:

$re = New-Object Regex $SearchStr, 'Compiled'
$re.Replace( $_ , $ReplaceStr )
Andy Arismendi
  • 50,577
  • 16
  • 107
  • 124
  • In the Python case, however, it still processes each file in one go (it just takes more lines of code to get there) so I'd imagine that memory usage is "about the same"... or did I miss something? :( –  Mar 15 '12 at 17:14
  • @pst I haven't tested but it looks like `s = f.read()` loads the entire thing into memory. You can also do this with PowerShell using `$reader.ReadToEnd()`. – Andy Arismendi Mar 15 '12 at 17:23
  • Ah, I was under the assumption that was how `Get-Content` operated :-/ –  Mar 15 '12 at 17:26
  • 2
    @pst Get-Content allows for streaming, meaning that the next cmdlet in the pipeline can start working as soon it enters the pipeline - even if more data is soon to come. When you assigned to $content, you lose that functionality because the entire get-content must finish before the next operation happens. – Daniel Richnak Mar 15 '12 at 21:41
  • I ran the ReadLine/WriteLine solution and it reduced my time down to 5 min! Huge improvement! I will have to try the -ReadCount and pipe approach later tonight. – Keith Mar 15 '12 at 23:33
8

I see this a lot:

$content | foreach {$_ -replace $SearchStr, $ReplaceStr} 

The -replace operator will handle an entire array at once:

$content -replace $SearchStr, $ReplaceStr

and do it a lot faster than iterating through one element at a time. I suspect doing that may get you closer to an apples-to-apples comparison.

mjolinor
  • 66,130
  • 7
  • 114
  • 135
3

You may want to try the following command:

gci C:\temp\csv\*.csv | % { (gc $_) -replace $SearchStr, $ReplaceStr | out-file $_}

In addition, some strings may require escape characters, hence you should use [regex]Escape to generate strings with escape characters built in. The code would look like:

gci C:\temp\csv\*.csv | % { (gc $_) -replace $([regex]::Escape($SearchStr)) $([regex]::Escape($ReplaceStr)) | out-file $_}
Mark
  • 31
  • 2
3

I don't know Python, but it looks like you are doing literal string replacements in the Python script. In Powershell, the -replace operator is a regular expression search/replace. I would convert the Powershell to using the replace method on the string class (or to answer the original question, I think your Powershell is inefficient).

ForEach ($file in Get-ChildItem C:\temp\csv\*.csv) 
{
    $content = Get-Content -path $file
    # look close, not much changes
    $content | foreach {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file
}

EDIT Upon further review, I think I see another (perhaps more important) difference in the versions. The Python version appears to be reading the entire file into a single string. The Powershell version on the other hand is reading into an array of strings.

The help on Get-Content mentions a ReadCount parameter that can affect the performance. Setting this count to -1 seems to read the entire file into a single array. This will mean that you are passing an array through the pipeline instead of individual strings, but a simple change to the code will deal with that:

# $content is now an array
$content | % { $_ } | % {$_.Replace($SearchStr, $ReplaceStr)} | Set-Content $file

If you want to read the entire file into a single string like the Python version seems to, just call the .NET method directly:

# now you have to make sure to use a FULL RESOLVED PATH
$content = [System.IO.File]::ReadAllText($file.FullName) 
$content.Replace($SearchStr, $ReplaceStr) | Set-Content $file

This is not quite as "Powershell-y" since you use the .NET APIs directly instead of the similar cmdlets, but they put the ability in there for times when you need it.

Gideon Engelberth
  • 6,095
  • 1
  • 21
  • 22
  • But regular expressions -- for any simple non-excessive backtracking regular expression -- generally have *very fast* implementations. Maybe not quite as fast as a normal string search (although it can be in some cases) but I doubt it is 17 times slower :( In any case, +1 for the difference and test code. –  Mar 15 '12 at 17:17
  • @pst The regular expression can be compiled in PowerShell which should give better performance. `$re = New-Object regex '\w+', 'Compiled'` – Andy Arismendi Mar 15 '12 at 17:29
  • Thanks for the test code above. I ran it and it still took a very long time to complete the task. – Keith Mar 15 '12 at 23:26
  • @pst *generally* perhaps, but I wouldn't assume so. Also I don't know how good Powershell is at caching the regex, so it may be recreating the regex at every iteration. Andy's idea would avoid that issue. – Gideon Engelberth Mar 16 '12 at 00:01
  • @GideonEngelberth The .NET BCL caches the last 20 regular expression or so [in most cases]. It would be interesting to see a benchmark shootout of the various approaches mentioned. –  Mar 16 '12 at 02:18
1

Actually, I'm faced with a similar issue right now. With my new job, i have to parse huge text files to pull information based on certain criteria. The powershell script (optimized to the brim) takes 4 hours to return a fully processed csv file. We wrote another python script that took just under 1 hour...

As much as i love powershell, i was heart broken. For your amusement, try this: Powershell:

$num = 0
$string = "Mary had a little lamb"

while($num -lt 1000000){
    $string = $string.ToUpper()
    $string = $string.ToLower()
    Write-Host $string
    $num++
}

Python:

num = 0
string = "Mary had a little lamb"

while num < 1000000:
    string = string.lower()
    string = string.upper()
    print(string)
    num+=1

and trigger the two jobs. You can even encapsulate in Measure-command{} to keep it "scientific".

Also, link, crazy read..

Alex
  • 498
  • 2
  • 8
  • 2
    This is nothing to do with the question, and in the link: of course the powershell cmdlets are slower, they get a lot more information about the file (like creation times as .Net DateTime, and Length) and construct PSCustomObjects, and they have code for PSProviders underneath them. The .Net method call has `@() +=` pattern in it which is absolutely cripplingly slow and never recommended for fast PowerShell. The claim that C# is compiled implies PowerShell is not - it is, by the DLR engine in PS 3 and above - but only for saved scripts and functions, not code run from ISE or typed in. – TessellatingHeckler Dec 15 '17 at 09:18
  • 2
    Your "Mary had a little lamb" comparison is also flawed - the `.lower()` and `.upper()` calls in both languages don't change the string, they generate and return a new string. In Python when you generate a value and don't use it, nothing happens. In PowerShell when you generate a value and don't assign it anywhere it goes to the output pipeline - meaning your PowerShell script is printing three times as many strings as your Python, via a more complex mechanism (the pipeline and output formatters, instead of write-host the equivalent of `print()`) – TessellatingHeckler Dec 15 '17 at 09:29
  • 1
    @TessellatingHeckler Thank you for that. updated the sample script. Also wrote them into .py and .ps1 files and ran both with: C:\temp> $time1 = get-date >> python.exe .\mary.py >> $time2 = get-date >> New-TimeSpan -Start $time1 -End $time2 and python finished in 49.7 seconds, and i got tired of waiting for PS... but ya, after more and more use, i'm convinced that any data related activities, use Py, and for operations, use Ps. Both are great and easy to use languages, so love them both. – Alex Dec 16 '17 at 22:26
  • 3
    The only reason I found your comment two months later was because I was chasing a PowerShell performance annoyance I wanted to improve, so I partly agree. Even so, if you comment out the print lines, both scripts run in 0.6 seconds. With print, the Python runs in 46 seconds, and with write-host, PowerShell runs in 10 minutes(!). But switching it for `[console]::writeline($string)` it takes only 37 seconds. `write-host` is a wrapper for Write-Information, an output stream Python doesn't have, to make up for all the scripts that `write-host` and then can't run remotely. – TessellatingHeckler Dec 17 '17 at 10:28