4

I have a question that should make most people go "WTF?", but I have it nonetheless.

I've got a bunch of data files from a vendor. It's in a custom flat-file format that claims to be CSV, except it's not comma separated, and values are not quoted. So, not really CSV at all.

foo,bar,baz
alice,bob,chris

And so on, except much longer and less interesting. The problem is, some records have embedded newlines (!!!):

foo,bar
rab,baz
alice,bob,chris

That is supposed to be two records of three fields each. Normally, I would just say "No, this is stupid.", but I inadvisedly looked closer, and discovered that it was actually a different kind of end of line than the actual line ending sequence:

foo,bar\n
rab,baz\r\n
alice,bob,chris\r\n

Note the \n on the first line. I've determined that this holds for all the cases I found of embedded newlines. So, I need to basically do s/\n$// (I tried this specific command, it did not do anything).

Note: I don't actually care about the contents of the fields, so replacing a newline with nothing is fine. I just need each line in the file to have the same number of records (ideally, in the same place).

I have an existing solution in the tool I wrote to process the files:

Guid g = Guid.NewGuid();

string data = File.ReadAllText(file, Encoding.GetEncoding("Latin1"));
data = data.Replace("\r\n", g.ToString()); //just so I have a unique placeholder
data = data.Replace("\n", "");
data = data.Replace(g.ToString(), "\r\n");

However, this fails on files that are bigger than a gigabyte or so. (Also, I haven't profiled it, but I suspect it's dog slow as well).

The tools I have at my disposal are:

  • cygwin tools (sed, grep, etc)
  • .NET

What is the best way to do this?

Mike Caron
  • 14,351
  • 4
  • 49
  • 77

4 Answers4

5

Instead of reading the entire thing into memory as a big (potentially huge) string, consider a stream based approach instead.

Open the input stream and read a line at a time, making your replacements as needed. Open an output stream and write the modified line into it. Something like:

static void Main( string[] args )
{
    using( var inFs = File.OpenRead( @"C:\input.txt" ) )
    using( var reader = new StreamReader( inFs ) )
    using( var outFs = File.Create( @"C:\output.txt" ) )
    using( var writer = new StreamWriter( outFs ) )
    {
        int cur;
        char last = '0';
        while( ( cur = reader.Read() ) != -1 )
        {
            char next = (char)reader.Peek();
            char c = (char)cur;
            if( c != '\n' || last == '\r' )
                writer.Write( c );

            last = c;
        }
    }
}
Ed S.
  • 122,712
  • 22
  • 185
  • 265
  • This! You are going to run into memory/speed problems otherwise with massive files like this. – naspinski Oct 30 '12 at 19:06
  • 1
    He wants to keep `\r\n` and drop only `\r` when it's by itself. – Jon B Oct 30 '12 at 19:07
  • @JonB: Oh, yep, thanks, I misread his example. Regardless, the method is the same. I will whip up some sample code. – Ed S. Oct 30 '12 at 19:08
  • This won't work. `TextReader.ReadLine` will read up to any end of line sequence, including `\n` or `\r` by themselves. It also won't return the EOL sequence, so this will just strip out all newlines. – Mike Caron Oct 30 '12 at 19:39
  • Also, there is a typo in the question title. It should be `\n`, not `\r` – Mike Caron Oct 30 '12 at 19:40
  • @MikeCaron: D'oh, and that's what I get for changing it at the last moment (I was reading character by character in the first example). I'll fix it, thanks – Ed S. Oct 30 '12 at 19:40
  • @MikeCaron: That part I got; he wants to remove '\n'. The code is (trying to) remove all \r's and then replacing the remaining \n's with \r\n. – Ed S. Oct 30 '12 at 19:41
  • @EdS.: I'm he :) And, that won't work, because that will turn all the `\r\n` sequences into `\n`, which is identical to the sequences I want to remove. Doing it the other way (replace `\n` with "" and turn `\r` into `\r\n`) would work. If you could show me an easy way to do that, I would be grateful – Mike Caron Oct 30 '12 at 19:43
  • @MikeCaron: Ahh, so you are :). You need to do this character by character anyway, so it's a matter of replacing \n when the previous character was not \r. The stream approach will at least fix your program, the rest is just details. – Ed S. Oct 30 '12 at 19:46
  • @MikeCaron: Ok, meeting was delayed by 5 so I had time to fix it. – Ed S. Oct 30 '12 at 19:54
  • That looks good! I ended up doing it slightly differently (see my answer below), but I'll accept yours since your solution is actually slightly more accurate to the question than mine :) – Mike Caron Oct 30 '12 at 19:56
  • @MikeCaron: Ok, thanks. I wanted to make sure I got it right; remove all \n's unless they are grouped as \r\n, right? :D – Ed S. Oct 30 '12 at 19:58
3

That's an awful lot of code to do something so simple.

Try this instead.

tr -d '\n' <dirtyfile >cleanfile
Ben Hardy
  • 1,739
  • 14
  • 16
0

Here's a StreamReader class that seems to do what I want. Note that this is probably incredibly domain specific, so it may or may not be useful:

class BadEOLStreamReader : StreamReader {
    private int pushback = -1;

    public BadEOLStreamReader(string file, Encoding encoding) : base(file, encoding) {

    }

    public override int Peek() {
        if (pushback != -1) {
            var r = pushback;
            pushback = -1;
            return r;
        }

        return base.Peek();
    }

    public override int Read() {
        if (pushback != -1) {
            var r = pushback;
            pushback = -1;
            return r;
        }

        skip:
        var ret = base.Read();
        if (ret == 13) {
            var ret2 = base.Read();
            if (ret2 == 10) {
                //it's good, push back the 10
                pushback = ret2;
                return ret;
            }
            pushback = ret2;
            //skip it
            goto skip;
        } else if (ret == 10) {
            //skip it
            goto skip;
        } else {

            return ret;
        }
    }
}
Mike Caron
  • 14,351
  • 4
  • 49
  • 77
0

EDIT: after some tests, the awk solution gives better results in terms of speed.

The standard file/input filter in UNIX/Linux/Cygwin have a hard time dealing with binary file. To do that with filters, you need to convert your file in HEX, edit it with sed (or awk, see 2nd solution bellow), and convert it back to its original data. This should do it:

xxd -c1 -p file.txt | 
  sed -n -e '1{h}' -e '${x;G;p;d}' \
      -e '2,${x;G;/^0d\n0a$/{P;b};/\n0a$/{P;s/.*//;x;b};P}' |
  xxd -r -p

Ok, this is not simple to understand, let's begin with the simple parts:

  • xxd -c1 -p file.txt converts file.txt from binary to HEX, one byte per line.
  • xxd -r -p reverts the convertion.
  • The sed replaces a \n (0a in HEX) that is not preceded by a \r (0d in HEX) by nothing.

The idea of the sed part is to store the previous byte in the hold space, and to deal with both the previous and the current byte:

  • At the 1st line, store the line (byte) in the hold space.
  • At the last line, print both bytes in the correct order (x;G;p) and stop the script (d).
  • For the lines in between, after having the current byte in the hold space and the 2 bytes (previous and current) in the pattern space (x;G), 3 possibles cases:
    1. If is a \r\n, then print \r keeping \n in the hold space for the next cycle and stop this cycle (b command).
    2. Else if it ends with \n (meaning that it did not begin by \r) store an empty string in the hold space and stop this cycle (b command)
    3. Else print the 1st character.

It might be simpler to understand in awk:

xxd -c1 -p file.txt |
  awk 'NR > 1 && $0 == "0a" && p != "0d" {$0 = ""}
       NR > 1 {print p}
       {p = $0}
       END{print p}' |
  xxd -r -p

It can be tested with:

printf "foo,bar\nrab,baz\r\nalice,bob,chris\r\n" |
  xxd -c1 -p | 
  sed -n -e '1{h}' -e '${x;G;p;d}' \
      -e '2,${x;G;/^0d\n0a$/{P;b};/\n0a$/{P;s/.*//;x;b};P}' |
  xxd -r -p

or

printf "foo,bar\nrab,baz\r\nalice,bob,chris\r\n" |
  xxd -c1 -p |
  awk 'NR > 1 && $0 == "0a" && p != "0d" {$0 = ""}
       NR > 1 {print p}
       {p = $0}
       END{print p}' |
  xxd -r -p
jfg956
  • 16,077
  • 4
  • 26
  • 34
  • An interesting (and thoroughly incomprehensible :) solution, but does it scale? Eg, if I need to process a 2gb data file, should I grab lunch? – Mike Caron Oct 31 '12 at 03:02
  • Incomprehensible: True, the sed part is not the easiest to understand. Did you understand the usage of xxd and then the idea of removing orphan `\n`. The awk solution might be easier to understand, but less efficient. – jfg956 Oct 31 '12 at 06:52
  • Does it scale: yes, it can deal with any size of input. Is it efficient: it depends on your point of view. The 1st xxd convert a 2 Gb binary file to a 6 Gb text file with one byte per line which is a lot a data. Then, each byte is read one at a time to decide if it need to be kept or not. sed and awk are probably not the best tool to do that with speed, but if you are not a programmer or if you do not have access to a compiler, they probably are the only solution you have. – jfg956 Oct 31 '12 at 06:57
  • If you want a fast solution, writing a well optimized program in C or Java (or Perl or Python, but I do not know those last 2) will solve the problem quick, but if it is a one time file processing, waiting a little might be a less painful solution. – jfg956 Oct 31 '12 at 07:00
  • A last thought, this solution breaks when using multi-bytes character sets (UNICODE). – jfg956 Oct 31 '12 at 07:01
  • About efficiency, the awk solution score 6 seconds on my laptop for 3.3 Mb file and I stop the sed solution after 60 seconds, so the awk solution is better. On a 13 Mb file, the awk solution score a little less than 24 seconds. On a 52 Mb file, it scores a 93 seconds. So I guess that it would take an hour with your 2 Gb file. – jfg956 Oct 31 '12 at 07:16
  • I am not a sed/awk/etc guru, so anything fancier than `s/x/y/g` is incomprehensible :) Anyway, I ended up going with a solution that removed the new lines as the file was streamed which seems to be pretty efficient, but thanks anyway! – Mike Caron Oct 31 '12 at 10:50