I have some CSV data I need to clean up by removing inline linefeeds and special characters like typographic quotes. I feel like I could get this working with Python or Unix utils, but I'm stuck on a pretty vanilla Windows 2012 box, so I'm giving PowerShell v5 a shot despite my lack of experience with it.
Here's what I'm looking to achieve:
$InputFile
:
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020306","John","Davis","Employee was not dressed appropriately."{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, “That’s not my job”"{CRLF}
"00020309","Bob","Meyers","Employee did the following:{LF}
• Showed up late{LF}
• Did not complete assignments{LF}
• Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
$OutputFile
:
"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, ""That's not my job"""{CRLF}
"00020309","Bob","Meyers","Employee did the following: * Showed up late * Did not complete assignments * Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}
The following code works:
(Get-Content $InputFile -Raw) `
-replace '(?<!\x0d)\x0a',' ' `
-replace "[‘’´]","'" `
-replace '[“”]','""' `
-replace "\xa0"," " `
-replace '[•·]','*' | Set-Content $OutputFile -Encoding ASCII
However, the actual data I'm dealing with is a 4GB file with over a million lines. Get-Content -Raw
runs out of memory. I tried Get-Content -ReadCount 10000
, but that removes all linefeeds, presumably because it reads line-wise.
More Googling brought me to Import-Csv which I got from here:
Import-Csv $InputFile | ForEach {
$_.notes = $_.notes -replace '(?<!\x0d)\x0a',' '
$_
} | Export-Csv $OutputFile -NoTypeInformation -Encoding ASCII
but I don't appear to have a notes property on my objects:
Exception setting "notes": "The property 'notes' cannot be found on this object. Verify that the property exists and can be set."
At C:\convert.ps1:53 char:5
+ $_.notes= $_.notes -replace '(?<!\x0d)\x0a',' '
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], SetValueInvocationException
+ FullyQualifiedErrorId : ExceptionWhenSetting
I found another example using the Value property, but I got the same error.
I tried running Get-Member
on each object and it looks like it's assigning properties based on the header from the file, like I may be able to get it with $_.DESCRIPTION
, but I don't know enough PowerShell to run the replacements on all of the properties :(
Please help? Thanks!
Update:
I ended up giving up on PS and coding this in AutoIT. It's not great, and it will be more difficult to maintain, especially since there hasn't been a new release in 2.5 years. But it works, and it crunches the prod file in 4 minutes.
Unfortunately, I couldn't key on the LF easily either, so I ended up going with the logic to create new lines based on ^"[^",]
(Line starts with a quote and the second character is not a quote or comma).
Here's the AutoIT code:
#include <FileConstants.au3>
If $CmdLine[0] <> 2 Then
ConsoleWriteError("Error in parameters" & @CRLF)
Exit 1
EndIf
Local Const $sInputFilePath = $CmdLine[1]
Local Const $sOutputFilePath = $CmdLine[2]
ConsoleWrite("Input file: " & $sInputFilePath & @CRLF)
ConsoleWrite("Output file: " & $sOutputFilePath & @CRLF)
ConsoleWrite("***** WARNING *****" & @CRLF)
ConsoleWrite($sOutputFilePath & " is being OVERWRITTEN!" & @CRLF & @CRLF)
Local $bFirstLine = True
Local $hInputFile = FileOpen($sInputFilePath, $FO_ANSI)
If $hInputFile = -1 Then
ConsoleWriteError("An error occurred when reading the file.")
Exit 1
EndIf
Local $hOutputFile = FileOpen($sOutputFilePath, $FO_OVERWRITE + $FO_ANSI)
If $hOutputFile = -1 Then
ConsoleWriteError"An error occurred when opening the output file.")
Exit 1
EndIf
ConsoleWrite("Processing..." &@CRLF)
While True
$sLine = FileReadLine($hInputFile)
If @error = -1 Then ExitLoop
;Replace typographic single quotes and backtick with apostrophe
$sLine = StringRegExpReplace($sLine, "[‘’´]","'")
;Replace typographic double quotes with normal quote (doubled for in-field CSV)
$sLine = StringRegExpReplace($sLine, '[“”]','""')
;Replace bullet and middot with asterisk
$sLine = StringRegExpReplace($sLine, '[•·]','*')
;Replace non-breaking space (0xA0) and delete (0x7F) with space
$sLine = StringRegExpReplace($sLine, "[\xa0\x7f]"," ")
If $bFirstLine = False Then
If StringRegExp($sLine,'^"[^",]') Then
$sLine = @CRLF & $sLine
Else
$sLine = " " & $sLine
EndIf
Else
$bFirstLine = False
EndIf
FileWrite($hOutputFile, $sLine)
WEnd
ConsoleWrite("Done!" &@CRLF)
FileClose($hInputFile)
FileClose($hOutputFile)