1

I have some CSV data I need to clean up by removing inline linefeeds and special characters like typographic quotes. I feel like I could get this working with Python or Unix utils, but I'm stuck on a pretty vanilla Windows 2012 box, so I'm giving PowerShell v5 a shot despite my lack of experience with it.

Here's what I'm looking to achieve:

$InputFile:

"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020306","John","Davis","Employee was not dressed appropriately."{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, “That’s not my job”"{CRLF}
"00020309","Bob","Meyers","Employee did the following:{LF}
• Showed up late{LF}
• Did not complete assignments{LF}
• Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}

$OutputFile:

"INCIDENT_NUMBER","FIRST_NAME","LAST_NAME","DESCRIPTION"{CRLF}
"00020307","Brad","Miller","Employee told customer, ""Go shop somewhere else!"""{CRLF}
"00020308","Ted","Jones","Employee told supervisor, ""That's not my job"""{CRLF}
"00020309","Bob","Meyers","Employee did the following: * Showed up late * Did not complete assignments * Left work early"{CRLF}
"00020310","John","Davis","Employee was not dressed appropriately."{CRLF}

The following code works:

(Get-Content $InputFile -Raw) `
    -replace '(?<!\x0d)\x0a',' ' `
    -replace "[‘’´]","'" `
    -replace '[“”]','""' `
    -replace "\xa0"," " `
    -replace '[•·]','*' | Set-Content $OutputFile -Encoding ASCII

However, the actual data I'm dealing with is a 4GB file with over a million lines. Get-Content -Raw runs out of memory. I tried Get-Content -ReadCount 10000, but that removes all linefeeds, presumably because it reads line-wise.

More Googling brought me to Import-Csv which I got from here:

Import-Csv $InputFile | ForEach {
    $_.notes = $_.notes -replace '(?<!\x0d)\x0a',' '
    $_
} | Export-Csv $OutputFile -NoTypeInformation -Encoding ASCII

but I don't appear to have a notes property on my objects:

Exception setting "notes": "The property 'notes' cannot be found on this object. Verify that the property exists and can be set."
At C:\convert.ps1:53 char:5
+     $_.notes= $_.notes -replace '(?<!\x0d)\x0a',' '
+     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : NotSpecified: (:) [], SetValueInvocationException
    + FullyQualifiedErrorId : ExceptionWhenSetting

I found another example using the Value property, but I got the same error.

I tried running Get-Member on each object and it looks like it's assigning properties based on the header from the file, like I may be able to get it with $_.DESCRIPTION, but I don't know enough PowerShell to run the replacements on all of the properties :(

Please help? Thanks!

Update:

I ended up giving up on PS and coding this in AutoIT. It's not great, and it will be more difficult to maintain, especially since there hasn't been a new release in 2.5 years. But it works, and it crunches the prod file in 4 minutes.

Unfortunately, I couldn't key on the LF easily either, so I ended up going with the logic to create new lines based on ^"[^",] (Line starts with a quote and the second character is not a quote or comma).

Here's the AutoIT code:

#include <FileConstants.au3>

If $CmdLine[0] <> 2 Then
   ConsoleWriteError("Error in parameters" & @CRLF)
   Exit 1
EndIf

Local Const $sInputFilePath = $CmdLine[1]
Local Const $sOutputFilePath = $CmdLine[2]

ConsoleWrite("Input file: " & $sInputFilePath & @CRLF)
ConsoleWrite("Output file: " & $sOutputFilePath & @CRLF)
ConsoleWrite("***** WARNING *****" & @CRLF)
ConsoleWrite($sOutputFilePath & " is being OVERWRITTEN!" & @CRLF & @CRLF)

Local $bFirstLine = True

Local $hInputFile = FileOpen($sInputFilePath, $FO_ANSI)
   If $hInputFile = -1 Then
        ConsoleWriteError("An error occurred when reading the file.")
        Exit 1
     EndIf

Local $hOutputFile = FileOpen($sOutputFilePath, $FO_OVERWRITE + $FO_ANSI)
   If $hOutputFile = -1 Then
        ConsoleWriteError"An error occurred when opening the output file.")
        Exit 1
     EndIf

ConsoleWrite("Processing..." &@CRLF)

While True
   $sLine = FileReadLine($hInputFile)
   If @error = -1 Then ExitLoop

   ;Replace typographic single quotes and backtick with apostrophe
   $sLine = StringRegExpReplace($sLine, "[‘’´]","'")

   ;Replace typographic double quotes with normal quote (doubled for in-field CSV)
   $sLine = StringRegExpReplace($sLine, '[“”]','""')

   ;Replace bullet and middot with asterisk
   $sLine = StringRegExpReplace($sLine, '[•·]','*')

   ;Replace non-breaking space (0xA0) and delete (0x7F) with space
   $sLine = StringRegExpReplace($sLine, "[\xa0\x7f]"," ")

   If $bFirstLine = False Then
      If StringRegExp($sLine,'^"[^",]') Then
         $sLine = @CRLF & $sLine
      Else
         $sLine = " " & $sLine
      EndIf
   Else
      $bFirstLine = False
   EndIf

   FileWrite($hOutputFile, $sLine)

WEnd

ConsoleWrite("Done!" &@CRLF)
FileClose($hInputFile)
FileClose($hOutputFile)
  • Just replace `$_.notes` with `$_.DESCRIPTION`, that should do the trick – Mathias R. Jessen Sep 29 '20 at 15:40
  • Thanks! The problem is that real data contains 600 fields per line, hundreds of which could potentially contain line breaks and special characters. – Joseph Smith Sep 29 '20 at 15:46
  • See this: [Reading large text files with Powershell](http://www.happysysadm.com/2014/10/reading-large-text-files-with-powershell.html), and you can replace all those -replace with a succinct regex or pattern. For example "'(?<!\x0d)\x0a',' '|'[‘’´]',''|'[“”]',''|'\xa0',' '|'[•·]','*'" – postanote Sep 29 '20 at 15:58
  • @postanote Thanks for the link! That's actually where I found the -ReadCount option that removed all the line breaks. – Joseph Smith Sep 29 '20 at 16:06

4 Answers4

0

Note:

  • See my other answer for a robust solution.

  • The answer below may still be of interest for a general line-by-line processing solution that performs well, although it invariably treats LF-only instance as line separators too (it has been updated to use the same regex to distinguish between a line that is start of a row and one that is a row's continuation that you use in the AutoIt solution you've added to the question).


Given the size of your file, I suggest sticking with plain-text processing for performance reasons:

  • The switch statement enables fast line-by-line processing; it recognizes both CRLF and LF as newlines, as PowerShell generally does. Note, however, given that each line returned has its trailing newline stripped, you won't be able to tell whether the input line ended in CRLF of just LF.

  • Using a .NET type directly, System.IO.StreamWriter, bypasses the pipeline and enables fast writes to the output file.

  • For general PowerShell performance tips, see this answer.

$inputFile = 'in.csv'
$outputFile = 'out.csv'

# Create a stream writer for the output file.
# Default to BOM-less UTF-8, but you can pass a [System.Text.Encoding]
# instance as the second argument.
# Note: Pass a *full* path, because .NET's working dir. usually differs from PowerShell's
$outFileWriter = [System.IO.StreamWriter]::new("$PWD/$outputFile")

# Use a `switch` statement to read the input file line by line.
$outLine = ''
switch -File $inputFile -Regex {
  '^"[^",]' { # (Start of) a new row.
    if ($outLine) { # write previous, potentially synthesized line
      $outFileWriter.WriteLine($outLine)
    }
    $outLine = $_ -replace "[‘’´]", "'" -replace '[“”]', '""' -replace '\u00a0', ' '
  }
  default { # Continuation of a row.
    $outLine += ' ' + $_ -replace "[‘’´]", "'" -replace '[“”]', '""' -replace '\u00a0', ' ' `
      -replace '[•·]', '*' -replace '\n'
  }
}
# Write the last line.
$outFileWriter.WriteLine($outLine)

$outFileWriter.Close()

Note: The above assumes that no row continuation also matches regex pattern '^"[^",]', which is hopefully robust enough (you've deemed it to be, given that you based your AutoIt solution on it).

This simple distinction between the start of a row and continuations on subsequent lines obviates the need for lower-level file I/O in order to distinguish between CRLF and LF newlines, which my other answer does.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    I finally got some time to test this today. On my 300-line test data (real data is over 1M lines), first method took 0.16 seconds, second took 4 seconds. Ran the first against the live data in 9.5 minutes. Tried the second one, but I killed it after 3.5 hours. Now that I'm running the numbers though, I bet it was damn close to done :( However, the resulting data has 16000 fewer lines than expected, and I'm really not sure why. Unfortunately, while I really, really appreciate your help, and I have learned quite a bit from your code, I have to move on. Thank you so much. – Joseph Smith Oct 06 '20 at 19:13
  • @JosephSmith: Thanks for the feedback re the performance of the `Invoke-Expression` solution, which led me to _remove_ the solution from the answer. I've updated the original solution to match the AutoIt solution you added to your question (which was a simple tweak, thanks to `switch -Regex`). However, I've since also posted a [fully robust](https://stackoverflow.com/a/64282298/45375) solution. – mklement0 Oct 09 '20 at 15:42
0

The first answer may be better than this, as I'm not sure if PS needs to load everything into memory this way or not (though I think it does), but going off what you started above, I was thinking along this line...

# Import CSV into a variable
$InputFile = Import-Csv $InputFilePath

# Gets all field names, stores in $Fields
$InputFile | Get-Member -MemberType NoteProperty | 
Select-Object Name | Set-Variable Fields

# Updates each field entry
$InputFile | ForEach-Object {
    $thisLine = $_
    $Fields | ForEach-Object {
            ($thisLine).($_.Name) = ($thisLine).($_.Name) `
                -replace '(?<!\x0d)\x0a',' ' `
                -replace "[‘’´]","'" `
                -replace '[“”]','""' `
                -replace "\xa0"," " `
                -replace '[•·]','*'
            }
    $thisLine | Export-Csv $OutputFile -NoTypeInformation -Encoding ASCII -Append
} 
b-frid
  • 116
  • 8
  • Thanks! This works! Unfortunately, it's also slower than the other methods by a significant margin. I have a 300-line sample that I've been testing with. The other methods have run through it in a few seconds, but this one took 3.5 minutes for some reason. – Joseph Smith Sep 29 '20 at 17:09
  • Yep too memory intensive, as I suspected. Does it help if we run it appending the updates line by line instead? (See edited answer above.) – b-frid Sep 29 '20 at 17:23
  • All you did was move the brace to include the Export in the ForEach field, right? That took a little longer actually, 3.75s. :( – Joseph Smith Sep 29 '20 at 17:34
  • And added the `-Append` flag to the `Export-CSV` call. Hmm... – b-frid Sep 29 '20 at 17:38
  • Well, unless I'm mistaken this way at least shouldn't fill up memory usage, so should still be feasible on the large data set (unlike perhaps the first approach that only uses one Export call). Seems like it should go faster though, so not sure about that. (When you said "3.75s" above you meant 3.75m, right?) Anyway if I'm right about that, it should complete, albeit after what will probably be several hours, by the sounds of it. – b-frid Sep 29 '20 at 17:40
  • Yeah, based on a rough byte calculation, it would take 210 hours :( – Joseph Smith Sep 29 '20 at 17:53
  • If you take out the regex statements completely (i.e. remove `$Fields | ForEach-Object { ($thisLine).($_.Name) = ($thisLine).($_.Name) -replace '(?<!\x0d)\x0a',' ' -replace "[‘’´]","'" -replace '[“”]','""' -replace "\xa0"," " -replace '[•·]','*'}`) and then run it (so it should just create an exact copy), does it still take that long? – b-frid Sep 29 '20 at 17:56
  • 654ms without the regex, as opposed to 215685ms with it. And for the record, I was only running the LF regex, not the others. Each line has 580 fields in case you missed that comment elsewhere. – Joseph Smith Sep 29 '20 at 18:06
  • 1
    @b-frid: The use of cmdlets that serialize to and from custom objects and the use of the pipeline, especially with `ForEach-Object`, make your approach incurably slow - which is unfortunate, because it makes for a conceptually more elegant solution. See [this answer](https://stackoverflow.com/a/54753717/45375) for PowerShell performance tips. As an aside: `-replace '\n', ' '` will do - there are no CRs inside the fields. – mklement0 Sep 29 '20 at 18:33
  • @mklement0: Based on Joseph's last comment, isn't the bulk of the time required for the `-replace` calls? – b-frid Sep 29 '20 at 18:46
  • 1
    @b-frid: It's primarily the nested pipeline for the 580 fields and that a script block must be executed for each field, not so much the `-replace` operations themselves. You can somewhat speed that up with using the `.ForEach()` _method_ instead of the `ForEach-Object` _cmdlet_, but overall such an object-based solution will be many times slower than a non-pipeline, plain-text solution. – mklement0 Sep 29 '20 at 18:51
0

Here's another "line-by-line" attempt, somewhat akin to mklement0's answer. It assumes that no "row-continuation" line will begin with ". Hopefully it performs much better!

# Clear contents of file (Not sure if you need/want this...)
if (Test-Path -type leaf $OutputFile) { Clear-Content $OutputFile }

# Flag for first entry, since no data manipulation needed there
$firstEntry = $true

foreach($line in [System.IO.File]::ReadLines($InputFile)) {
    if ($firstEntry) {
        Add-Content -Path $OutputFile -Value $line -NoNewline
        $firstEntry = $false
    }
    else {
        if ($line[0] -eq '"') { Add-Content -Path $OutputFile "`r`n" -NoNewline}
        else { Add-Content -Path $OutputFile " " -NoNewline}
        $sanitizedLine = $line -replace '(?<!\x0d)\x0a',' ' `
                               -replace "[‘’´]","'" `
                               -replace '[“”]','""' `
                               -replace "\xa0"," " `
                               -replace '[•·]','*'
        Add-Content -Path $OutputFile -Value $sanitizedLine -NoNewline
    }
}

The technique is based on this other answer and its comments: https://stackoverflow.com/a/47146987/7649168

(Also thanks to mklement0 for explaining the performance issues of my previous answer.)

b-frid
  • 116
  • 8
  • 1
    In terms of performance, `[System.IO.File]::ReadLines()` is on par with a `switch` statement, though the latter has the advantage of built-in support for multi-branch conditionals (equality matching (default), wildcard matching (`-Wildcard`), regex matching (`-Regex`). However, use of `Add-Content` is another bottleneck, not least because the file is opened and closed for every call. A `[System.IO.StreamWriter]` instance, as shown in my answer, solves that problem. – mklement0 Sep 30 '20 at 14:42
  • 1
    I was able to get back to testing today. Unfortunately, while this solution is much faster than the previous, it's still not viable for the production data. The 300-line sample, representing about 0.03% of the live data, took 10 seconds with this method, so over 8 hours for the live data. I really appreciate the help though! – Joseph Smith Oct 06 '20 at 19:07
0

The following two approaches would work in principle, but are too slow with a large input file such as yours.

  • Object-oriented processing with Import-Csv / Export-Csv:

    • Use Import-Csv to parse the CSV into objects, modify the objects' DESCRIPTION property values, then reexport with Export-Csv. Since the row-internal LF-only newlines are inside double-quoted fields, they are recognized as being part of the same row.

    • While a robust and conceptually elegant approach, it is by far the slowest and also very memory-intensive - see GitHub issue #7603, which discusses the reasons, and GiHub feature request #11027 to improve the situation by outputting hashtables rather than custom objects ([pscustomobject]).

  • Plain-text processing with Get-Content / Set-Content:

    • Use Get-Content -Delimiter "`r`n" to split the text file into lines by CRLF only, not also LF, transform each line as needed and save it to the output file with Set-Content.

    • While you pay a performance penalty for the conceptual elegance of using the pipeline in general, which makes saving the results with Set-Content line by line somewhat slow, Get-Content is especially slow, because it decorates each output string (line) with additional properties about the originating file, which is costly. See the green-lighted, but not yet implemented GitHub feature request #7537 to improve performance (and memory use) by omitting this decoration.


Solution:

  • For performance reasons, direct use of .NET APIs is therefore required.
    • Note: If the PowerShell solution should still be too slow, consider creating a helper class via ad-hoc compilation of C# code using Add-Type; ultimately, of course, using only compiled code will perform best.
  • While there is no direct equivalent to Get-Content -Delimiter "`r`n", you can read text files in fixed-size blocks (arrays) of characters, , using the System.IO.StreamReader.ReadBlock() method (.NET Framework 4.5+ / .NET Core 1+), on which you can then perform the desired transformations, as shown below.

Note:

  • For best performance, choose a high $BUFSIZE value below to minimize the number of reads and processing iterations; obviously, the value must be chosen so that you don't run out of memory.

  • There's not even a need to parse the blocks read into CRLF newlines, because you can simply target the LF-only lines with a regex that is a modified version of the one from your original approach, '(?<!\r|^)\n' (see code comments below).

  • For brevity, error handling is omitted, but the .Close() calls to close the files should generally be placed in the finally block of a try / catch / finally statement.

# In- and output file paths.
# Note: Be sure to use *full* paths, because .NET's working dir. usually
#       differs from PowerShell's.
$inFile = "$PWD/in.csv"
$outFile = "$PWD/out.csv"

# How many characters to read at once.
# This is a tradeoff between execution speed and memory use.
$BUFSIZE = 100MB
$buf = [char[]]::new($BUFSIZE)

$inStream = [IO.StreamReader]::new($inFile)
$outStream = [IO.StreamWriter]::new($outFile)
  
# Process the file in fixed-size blocks of characters.
while ($charsRead = $inStream.ReadBlock($buf, 0, $BUFSIZE)) {
  # Convert the array of chars. to a string.
  $block = [string]::new($buf, 0, $charsRead)
  # Transform the block and write it to the output file.
  $outStream.Write(
    # Transform block-internal LF-only newlines to spaces and perform other
    # subsitutions.
    # Note: The |^ part inside the negative lookbehind is to deal with the
    #       case where the block starts with "`n" due to the block boundaries
    #       accidentally having split a CRLF sequence.
    ($block -replace '(?<!\r|^)\n', ' ' -replace "[‘’´]", "'" -replace '[“”]', '""' -replace '\u00a0', ' ' -replace '[•·]', '*')
  )
}

$inStream.Close()
$outStream.Close()
mklement0
  • 382,024
  • 64
  • 607
  • 775