0

I have a csv file originated from Linux Server, so they contains \n to mark the ending of a line. Now I process this file through a powershell script and replace it's 'text qualifier' to something else (that I do because I am using SSIS to upload the csv feed to database and for some odd reasons SSIS don't supports 'embedded text qualifiers')

Part of the script which do this replacement looks like this

gc $file.FullName |
    % { if($_.indexOf("|~|") -eq -1) {$_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""} else {$_ -replace " ", " "}} |
    sc  $temppath

This scripts works fine but that also change the line feed at the end to \r\n That I understand should not have been that big of problem until I realized that my original feed also contains occasional \r in the description column which is also getting replaced with "\r\n" Now SSIS package is unable to identify where' the csv line ends.

I searched and found this is due to Get-Content which work line by line so I changed the command to following.

[System.IO.File]::ReadAllText($file.FullName) |
            % { if($_.indexOf("|~|") -eq -1) {$_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""} else {$_ -replace " ", " "}} |
            sc  $temppath

That seems to solve my issue but now I am trapped with *"OutOfMemoryException" as some of the csv files are big (about 400-500 MB) *

Any Suggestions what I can possibily do? Perhaps some replacement for ReadAllText() that works for big files?

Ankit
  • 1,867
  • 2
  • 21
  • 40
  • It's not a programming answer but if you are just having issues with Carriage returns and Line feeds could you just use http://sourceforge.net/projects/dos2unix/ to work with the line breaks for you. – Matt Jul 22 '14 at 11:22
  • I reread your post and realize why that wont work. – Matt Jul 22 '14 at 11:27
  • You could convert the file before and after processing? – Matt Jul 22 '14 at 11:42
  • @Matt I am sorry but I didn't understood what you meant by convert. can you please explain. I would certainly love to sort this out with whatever it takes. – Ankit Jul 22 '14 at 11:44
  • Since ReadAllText() is already in the .NET domain, you may want to dive even deeper: http://stackoverflow.com/questions/5943086/c-sharp-very-large-string-manipulation-out-of-memory-exception – Alexander Obersht Jul 22 '14 at 14:28

1 Answers1

2

The problem is that by piping the output of Get-Content, you're ripping the file into individual lines, and then Set-Content combines those line into a new file. Since CR/LF is the way lines are separated in Windows, that's what PowerShell cmdlets use to combine lines into a file (you get the same behavior with Out-File, not surprisingly). But you already knew that. Now what's the solution?

One way is to use the -join operator to join all the lines into a single string consisting of the lines separated by LF characters, and pipe that string to Set-Content:

(Get-Content $file | %{
  if ($_.indexOf("|~|") -eq -1) {
     $_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""
  } else {
    $_ -replace " ", " "
  }
}) -join "`n" | Set-Content  $temppath

Although this joins all the lines into a single string that needs to be stored in memory, I strongly suspect that this will work better for you, because the OutOfMemoryException is unlikely to be a limitation of system resources with 500MB files, so it's probably a limitation the .NET class.

However, if that still gives you memory errors, or it works but is too taxing on system resources, you can add the lines to the file one at a time in each iteration of the pipeline by using System.IO.File's AppendAllText method to append the lines without adding a newline (which piping to Out-File -Append would do), and tacking a "`n" onto each one:

Get-Content $file | %{
  [System.IO.File]::AppendAllText($temppath, $(
    (if ($_.indexOf("|~|") -eq -1) {
       $_ -replace "`"((?:`"`"|.)*?)`"(?!`")", "|~|`$1|~|" -replace "`"`"", "`""
    } else {
      $_ -replace " ", " "
    }) + "`n"
  ))
}

The would be slower, but drastically less memory-intensive.

Note, BTW, that gc $file.FullName is redundant, because a FileInfo object is implicitly converted to a string as its FullName property, so gc $file is sufficient.

Adi Inbar
  • 12,097
  • 13
  • 56
  • 69
  • Thanks for the detailed answer. I am going to try these out and come back. But line by line operation with Get-Content had one added problem. As I specified in the question my feed occasionally contains '\r' in description column which should not be treaded as line ending. But Get-Content do. So they are also replaced. But I want them to stay as it. AppendAllText() was doing that just fine. – Ankit Jul 23 '14 at 05:43
  • Oh, I missed the part about CR by itself being replaced by CRLF. It appears that **Get-Content** by default splits on CR, LF, and CRLF. Try `Get-Content -Delimiter "\`n" $file`, then *don't* add a `"\`n"` to each line (so in the first option, use `-join ''`, and in the second one, omit the `+ "\`n"`). – Adi Inbar Jul 23 '14 at 08:08