4

I have a big file >1.5GB, and it has '#@#@#' as row delimiter. I will replace this with a CRLF character before I process it through Informatica. The issue is, I have CR ,LF characters in the file, and I need to get rid of them before the replacement. I have found couple options to do that, but due to the size, I get OutofMemory exceptions.

param
(
  [string]$Source,
  [string]$Destination
)

echo $Source
echo $Destination

$Writer = New-Object IO.StreamWriter $Destination
$Writer.Write( [String]::Join("", $(Get-Content $Source)) )
$Writer.Close()

My question is, is there anyway to set my row delimiter as '#@#@#' and then read the file line by line to remove CR,LF characters.

mklement0
  • 382,024
  • 64
  • 607
  • 775
yasemin
  • 95
  • 3
  • 11
  • Sounds like you need more memory then. I know I can load a 1.5GB file with 4GB of ram on my computer. Most of the utilities I have ever used to remove the CRLF require it to load the entire file into memory first before it can remove the CRLF. Hopefully someone comes up with a better solution for you. – Squashman Dec 21 '16 at 16:34
  • thank you Squashman, It is my last option to ask for extra memory. I will see if anyone comes up with a different solution. Thank you. – yasemin Dec 21 '16 at 16:48
  • Can you try not using the StreamWriter and may be not loading into the memory itself ? something like: Get-Content "C:\source.txt" | Foreach-Object {$_.Replace(..) } | Set-Content C:\..Output.txt – Prageeth Saravanan Dec 21 '16 at 17:29
  • The input file contains CR and LF characters, but can they also occur as an CR+LF sequence? what do you want CR and LF to be replaced with? – aschipfl Dec 21 '16 at 18:14
  • yes it is possible that we get them CR+LF as well. I want to replaced them with space only. – yasemin Dec 21 '16 at 20:11
  • 1
    This sounds like a job for [sed or tr, etc](http://stackoverflow.com/questions/10748453/replace-comma-with-newline-in-sed) and [Enable true linux shell on windows 10 anniversary edition](http://www.developerinsider.in/step-vise-guide-to-enable-windows-10s-ubuntu-bash-shell-windows-subsystem-for-linux/) – Kraang Prime Dec 21 '16 at 22:56
  • @yasemin, so I understand every single CR and every single LF are to be replaced by a single space; what about consecutive sequences like CR+LF, CR+CR, LF+CR+LF,...? replace them by a single space? How long are the lines of the original input file, or in other words, how many characters lie in between two LF bytes at most? I am asking this in order to evaluate whether or not it is possible to provide a pure batch file solution because of the ~8190 bytes line length limit... – aschipfl Dec 22 '16 at 15:25
  • each line is 1024 character, which is I believe the max character that can fit in text file. No CR, LF characters should exist in the file, but like I said, there are couple of them. I am trying to get rid of all CR and LF characters (order not important). and then replace '#@#@#' with CRLF. The solution dbenham provided seems to be working, I have changed a little to remove all o0ccurences of CR and LF , not only CRLF. I will let you know of the result – yasemin Dec 22 '16 at 16:06

2 Answers2

3

OK, my first attempt was unbearably slow. Here is a good solution that was able to process a 1.8 GB file in 2 min 48 sec :-)

I used hybrid batch/JScript, so it runs on any Windows machine from XP onward - no 3rd party exe file is needed, nor is any compilation needed.

I read and write ~1 MB chunks. The logic is actually pretty simple.

I replace all \r\n with a single space, and #@#@# with \r\n. You can easily change the string values in the code to suit your needs.

fixLines.bat

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::--- Batch section within JScript comment that calls the internal JScript ----
@echo off
setlocal disableDelayedExpansion

if "%~1" equ "" (
  echo Error: missing input argument
  exit /b 1
)
if "%~2" equ "" (
  set "out=%~f1.new"
) else (
  set "out=%~2"
)

<"%~1" >"%out%" cscript //nologo //E:JScript "%~f0"
if "%~2" equ "" move /y "%out%" "%~1" >nul

exit /b

----- End of JScript comment, beginning of normal JScript  ------------------*/
var delim='#@#@#',
    delimReplace='\r\n',
    nl='\r\n',
    nlReplace=' ',
    pos=0,
    str='';

var delimRegex=new RegExp(delim,"g"),
    nlRegex=new RegExp(nl,"g");

while( !WScript.StdIn.AtEndOfStream ) {
  str=str.substring(pos)+WScript.StdIn.Read(1000000);
  pos=str.lastIndexOf(delim)
  if (pos>=0) {
    pos+=delim.length;
    WScript.StdOut.Write(str.substring(0,pos).replace(nlRegex,nlReplace).replace(delimRegex,delimReplace));
  } else {
    pos=0
  }
}
if (str.length>pos) WScript.StdOut.Write(str.substring(pos).replace(nlRegex,nlReplace));

To fix input.txt and write the output to output.txt:

fixLines input.txt output.txt

To overwrite the original file test.txt

fixLines test.txt

Just for kicks, I attempted to process the 1.8 GB file using JREPL.BAT. I didn't think it would work because it must load the entire file into memory. It doesn't matter how much memory is installed in the computer - JScript is limited to 2GB max string size. And I think there are additional constraints that come into play.

jrepl "\r?\n:#@#@#" " :\r\n" /m /x /t : /f input.txt /o output.txt

It took 5 minutes for the command to fail with an "Out Of Memory" error. And then it took a long time for my computer to recover from the serious abuse of memory.

Below is my original custom batch/JScript solution that reads and writes one character at a time.

slow.bat

@if (@X)==(@Y) @end /* Harmless hybrid line that begins a JScript comment

::--- Batch section within JScript comment that calls the internal JScript ----
@echo off
setlocal disableDelayedExpansion

if "%~1" equ "" (
  echo Error: missing input argument
  exit /b 1
)
if "%~2" equ "" (
  set "out=%~f1.new"
) else (
  set "out=%~2"
)

<"%~1" >"%out%" cscript //nologo //E:JScript "%~f0"
if "%~2" equ "" move /y "%out%" "%~1" >nul

exit /b

----- End of JScript comment, beginning of normal JScript  ------------------*/
var delim='#@#@#',
    delimReplace='\r\n',
    nlReplace=' ',
    read=1,
    write=2,
    pos=0,
    char;

while( !WScript.StdIn.AtEndOfStream ) {
  chr=WScript.StdIn.Read(1);
  if (chr==delim.charAt(pos)) {
    if (++pos==delim.length) {
      WScript.StdOut.Write(delimReplace);
      pos=0;
    }
  } else {
    if (pos) {
      WScript.StdOut.Write(delim.substring(0,pos));
      pos=0;
    }
    if (chr=='\n') {
      WScript.StdOut.Write(nlReplace);
    } else if (chr!='\r') {
      WScript.StdOut.Write(chr);
    }
  }
}
if (pos) WScript.StdOut.Write(delim.substring(0,pos));

It worked, but it was a dog. Here is a summary of timing results to process a 155 MB file:

slow.bat     3120 sec  (52 min)
jrepl.bat      55 sec
fixLines.bat   15 sec

I verified that all three solutions gave the same result.

dbenham
  • 127,446
  • 28
  • 251
  • 390
  • How long is that going to take on 1,500,000,000 bytes? This seems better suited for JREPL if she had enough ram in her computer. – Squashman Dec 21 '16 at 22:27
  • @Squashman - The size is the whole problem - JREPL would require `/M` option, which means the entire file must fit in memory, and I don't think JREPL can work with such a big file in memory. I imagine performance could be improved a bit by reading a block of data at a time (1mb maybe?), but you still would have to process char by char, and handle delims that span blocks. – dbenham Dec 21 '16 at 22:46
  • This looks like a lot of excessive file io, and seems prone to memory leaks. Why not try something simpler like chunking the file into larger blocks or the csv parser with a custom line end ? Or maybe something like http://cc.davelozinski.com/c-sharp/the-fastest-way-to-read-and-process-text-files – Kraang Prime Dec 21 '16 at 22:47
  • @SamuelJackson - Memory leaks? I think not ! But yes, reading byte by byte may slow it down a bit. I just acknowledged to Squashman that reading blocks might improve performance a bit. But I don't think the performance will suffer *that* bad. I'm not aware of any existing utilities (csv or otherwise) that can efficiently read lines with arbitrary line terminators. Any solution will have to logically process byte by byte, but the IO could be buffered. – dbenham Dec 21 '16 at 22:59
  • @dbenham, pretty sure I have used your JREPL on file sizes bigger than 1GB with the /M option. But I also have 16GB of ram on my work computer but they only give me 4GB on the Virtual Servers I use for Production. – Squashman Dec 21 '16 at 23:14
  • It's not so much the reading byte by byte that would lead to memory leaks, but the combination of buffered data, and multi-byte reverse bit shifts (for example collapsing '#@#@#' to '\n' a shift left reduction of 5 bytes. This means that all bytes to the right need to move one position left. If reading he bytes and just writing out, then you can end up with overflow as the file gets larger and appending the bytes to the end. – Kraang Prime Dec 22 '16 at 00:52
  • A faster, more memory efficient way would be to read from the end of the file in reverse, and insert into the new file (always inserting the char at the first position of the file) which would push the bytes to the right. – Kraang Prime Dec 22 '16 at 00:53
  • @SamuelJackson - My mistake, it was *that* bad times 10 LOL. I've posted a new solution that works pretty darn well. Neither solution has shifting bits. The before and after strings are independent in JavaScript. – dbenham Dec 22 '16 at 03:17
  • 1
    @Squashman - I've confirmed that JREPL fails with 1.8 GB file. Also my original solution was a complete dog. But the new solution is pretty good!. – dbenham Dec 22 '16 at 03:18
  • I realized you did not put comma between nl and nlReplace. Is that on purpose? nl='\r\n' nlReplace=' ', – yasemin Dec 22 '16 at 16:47
  • @yasemin - No! Thanks for the heads up. I've edited the code. But I don't understand why it seems to work either way :-/ I guess I need to dive into the JScript spec a bit. – dbenham Dec 22 '16 at 16:58
  • Yes it does surprisingly. By the way code worked like a charm :). I can't thank you enough. Kudos :) – yasemin Dec 22 '16 at 17:34
  • @yasemin - Don't forget to accept an answer if it fully answers your question to your satisfaction. – dbenham Dec 22 '16 at 18:27
2

Conceptually simple and memory-efficient, but slow PowerShell solution:

This PowerShell (v2+) solution is slow, but it is conceptually simple and you shouldn't run out of memory, because the input lines are processed one at a time, using #@#@# as the line delimiter.

Note: This solution combines your two steps:

  • It replaces the original line breaks with a single space each,

  • and it replaces each #@#@# sequence with a newline.

# Create sample input file.
@'
line 1 starts here
and
ends here#@#@#line 2 is all on one line#@#@#line 3 spans
two lines#@#@#
'@ > file

# Determine the input file.
$inFile = 'file'
# Create the output file.
$outFile = 'out'
$null = New-Item -Type File $outFile

Get-Content -Delimiter '#@#@#' $inFile | % {
  Add-Content -Value ($_.Replace("`r`n", " ").Replace($sep, '')) $outFile      
}

Note:

  • When you use -Delimiter, the specified delimiter is included in each item passed through the pipeline (unlike the default behavior, where the default delimiter (newline) is stripped).

  • Add-Content automatically adds a trailing CRLF to its output (in PSv5+, this could be suppressed with -NoNewLine).

  • The approach uses the [string] type's .Replace() method rather than PowerShell's flexible, regex-based -replace operator, because .Replace() performs literal replacements, which are faster (the equivalent command is
    Add-Content -Value (($_ -replace '\r\n', ' ') -replace '#@#@#') $outFile.
    That said, the speed gain is negligible; it's the file I/O part that takes up most of the time).


Much faster PowerShell solution with on-demand compilation of C# code

dbenham's clever and elegant batch + JScript solution is significantly faster than the above PowerShell solution.

Here's an adaptation of his approach to using C# code inside a PowerShell script that is compiled on demand.

Compilation is surprisingly quick (on the order of 0.3 seconds on my late-2012 iMac), and the use of compiled code to process the file results in a significant performance gain.
Also note that compilation is only performed once per session, so subsequent invocations do not pay this penalty.

Processing a ~ 1 GB file (created by repetition of the contents of the above sample file) with the script printed below yields the following:

Compiling...
Processing file...
Completed:
  Compilation time:      00:00:00.2343647
  File-processing time:  00:00:26.0714467
  Total:                 00:00:26.3278546

Execution times in real-world applications will differ based on many factors, but based on @dbenham's timings mentioned in the comments below, the the on-demand compilation solutions is about twice as fast as the batch+JScript solution.


Source code of the fast PowerShell solution:

# Determine the input and output files.
$inFile = 'file'
$outFile = 'out'

# Get current time stamp for measuring duration.
$dtStart = [datetimeoffset]::UtcNow

# How many characters to read at a time.
# !! Make sure that this at least as large as the max. input.line length.
$kCHUNK_SIZE = 1000000 

Write-Host 'Compiling...'

# Note: This statement performs on-demand compilation, but only 
#       on *first* invocation in a given session.
$tsCompilation = Measure-Command {

    Add-Type @"
  using System;
  using System.IO;

  namespace net.same2u.so
  {
    public static class Helper
    {

      public static void TransformFile(string inFile, string outFile, string sep)
      {
        char[] bufChars = new char[$kCHUNK_SIZE];
        using (var sw = new StreamWriter(outFile))
        using (var sr = new StreamReader(inFile))
        {
          int pos = 0; bool eof = false;
          string bufStr, rest = String.Empty;
          while (!(eof = sr.EndOfStream) || rest.Length > 0)
          {
            if (eof)
            {
              bufStr = rest;
            }
            else
            {
              int count = sr.Read(bufChars, 0, $kCHUNK_SIZE);
              bufStr = rest.Length > 0 ? rest + new string(bufChars, 0, count) : new string(bufChars, 0, count);
            }
            if (-1 == (pos = bufStr.LastIndexOf(sep))) // should only happen at the very end
            {
              sw.Write(bufStr);
              rest = String.Empty;
            }
            else
            {
              pos += sep.Length; rest = bufStr.Substring(pos);
              sw.Write(bufStr.Substring(0, pos).Replace(Environment.NewLine, " ").Replace(sep, Environment.NewLine));
            }
          }

        }
      }

    }

  } // class Helper

"@
    if (-not $?) { exit 1 }
}

Write-Host 'Processing file...'

# Make sure the .NET framework sees the same current dir. as PS.
[System.IO.Directory]::SetCurrentDirectory($PWD)

$tsFileProcessing = Measure-Command {
  [net.same2u.so.Helper]::TransformFile($inFile, $outFile, '#@#@#')
}

Write-Host @"
Completed:
  Compilation time:      $tsCompilation
  File-processing time:  $tsFileProcessing
  Total:                 $([datetimeoffset]::UtcNow - $dtStart) 
"@
Community
  • 1
  • 1
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Add some parallel processing to that script and should be able to cut the time down dramatically, while preserving memory :) -- upvoted for minimal impact on system resources. Still prone to leaks (as most large file operations are), but overall a workable solution in my opinion (just needs some PP) – Kraang Prime Dec 21 '16 at 22:52
  • Thanks, @SamuelJackson. I assume you're kidding about the parallel processing (do tell me if you're not, and what you had in mind). What leaks could there be? – mklement0 Dec 21 '16 at 22:55
  • Scroll down a ways [on this site](http://cc.davelozinski.com/c-sharp/the-fastest-way-to-read-and-process-text-files) to see the parallel processing examples for reading in c# -- I believe PowerShell has some parallel processing capabilities (but I could be wrong). The memory leaks with manipulating large files come naturally with buffering data (which is what is necessary to 'look' for something to replace with something else. It would almost be faster to do byte replacement, so replace '#@#@#' with '\n ' (5 spaces) as this wouldn't require the initial pointer to move. – Kraang Prime Dec 21 '16 at 23:02
  • Ugh, only marginally better than my original 1 byte at a time batch/JScript solution: 39 min to process a 155 MB file vs 52 min. That still scales to 6.5 hrs for a 1.5 GB file - not very practical. Compare that to my [latest batch/JScript solution](http://stackoverflow.com/a/41270186/1012053) at 2 min 48 sec for a 1.8 GB file. I imagine you could translate my JScript algorithm into powershell and get a bit more performance. – dbenham Dec 22 '16 at 04:28
  • @dbenham: Thanks for running a benchmark. I knew it wouldn't be fast, but I also didn't expect it to be that slow. PowerShell is the champion of abstraction, which is great, but at the expense of performance, which _sometimes_ makes its use impractical. – mklement0 Dec 22 '16 at 22:55
  • @SamuelJackson: Thanks. AFAIK, PowerShell's parallel-processing capabilities are limited to workflows and cmdlets that support the `-AsJob` parameter, neither of which would help here. I'm still unclear on why buffering would _necessarily_ introduce memory leaks. – mklement0 Dec 22 '16 at 23:00
  • @dbenham: I've adapted your clever batch + JScript solution to a PowerShell + C# solution (giving you due credit). Please see my update. – mklement0 Dec 23 '16 at 17:31
  • Very nice :-) But I don't understand why my batch/JScript solution is so slow on your machine. Perhaps there is a major performance penalty running the WSH script host on you iMac. You see a performance gain of ~20 going from my code to yours. But when I run your compiled code against a 155MB file, I get 8 sec, as opposed to 15 sec with my pure script code - a factor of only ~2. Still significant, but not earth shattering. – dbenham Dec 24 '16 at 18:01
  • @dbenham: Interesting. Note that my sample lines are quite short, which probably favors the compiled code, so the results may be skewed (that was just happenstance - I simply replicated my short sample lines to create a large file). What do you get when you run my solution against the file you used in your answer? I ran both solutions "cold" (each after a boot) in a single-core VMware Fusion W10 VM. I've amended my conclusion in the answer so as not to suggest that a factor of 20 is typical. – mklement0 Dec 24 '16 at 19:17
  • I replicated your test file to ~400 MB, and your code took 17 sec, while mine took 38 sec. My original 1.8 GB file took 90 sec with your code, 168 sec with mine. – dbenham Dec 24 '16 at 20:12
  • @dbenham: Thanks for digging deeper. Your numbers make more sense, and while I have no explanation for mine, I don't think we need to dig deeper. I've amended the conclusion to say that the compiled solution is about twice as fast. – mklement0 Dec 24 '16 at 22:08