I have a working script who's objective is to parse data files for malformed rows before importing into Oracle. To process a 450MB csv file with > 1 million rows having 8 columns it takes a little over 2.5hrs and maxes a single CPU core. Small files complete quickly (in seconds).
Oddly a 350MB file with similar number of rows and 40 columns only takes 30 mins.
My issue is that the files will grow over time and 2.5 hours tying up a CPU ain't good. Can anyone recommend code optimisation ? A similarly title post recommended local paths - which I'm already doing.
$file = "\Your.csv"
$path = "C:\Folder"
$csv = Get-Content "$path$file"
# Count number of file headers
$count = ($csv[0] -split ',').count
# https://blogs.technet.microsoft.com/gbordier/2009/05/05/powershell-and-writing-files-how-fast-can-you-write-to-a-file/
$stream1 = [System.IO.StreamWriter] "$path\Passed$file-Pass.txt"
$stream2 = [System.IO.StreamWriter] "$path\Failed$file-Fail.txt"
# 2 validation steps: (1) count number of headers is ge (2) Row split after first col. Those right hand side cols must total at least 40 characters.
$csv | Select -Skip 1 | % {
if( ($_ -split ',').count -ge $count -And ($_.split(',',2)[1]).Length -ge 40) {
$stream1.WriteLine($_)
} else {
$stream2.WriteLine($_)
}
}
$stream1.close()
$stream2.close()
Sample Data File:
C1,C2,C3,C4,C5,C6,C7,C8 ABC,000000000000006732,1063,2016-02-20,0,P,ESTIMATE,2015473497A10 ABC,000000000000006732,1110,2016-06-22,0,P,ESTIMATE,2015473497A10 ABC,,2016-06-22,,201501 ,,,,,,,, ABC,000000000000006732,1135,2016-08-28,0,P,ESTIMATE,2015473497B10 ABC,000000000000006732,1167,2015-12-20,0,P,ESTIMATE,2015473497B10
} blocks around each section of your code to see where the delays are? i.e.: is it the loading, the file writing, etc.
– Simon Catlin Jan 27 '17 at 10:47