I have a question that should make most people go "WTF?", but I have it nonetheless.
I've got a bunch of data files from a vendor. It's in a custom flat-file format that claims to be CSV, except it's not comma separated, and values are not quoted. So, not really CSV at all.
foo,bar,baz
alice,bob,chris
And so on, except much longer and less interesting. The problem is, some records have embedded newlines (!!!):
foo,bar
rab,baz
alice,bob,chris
That is supposed to be two records of three fields each. Normally, I would just say "No, this is stupid.", but I inadvisedly looked closer, and discovered that it was actually a different kind of end of line than the actual line ending sequence:
foo,bar\n
rab,baz\r\n
alice,bob,chris\r\n
Note the \n on the first line. I've determined that this holds for all the cases I found of embedded newlines. So, I need to basically do s/\n$//
(I tried this specific command, it did not do anything).
Note: I don't actually care about the contents of the fields, so replacing a newline with nothing is fine. I just need each line in the file to have the same number of records (ideally, in the same place).
I have an existing solution in the tool I wrote to process the files:
Guid g = Guid.NewGuid();
string data = File.ReadAllText(file, Encoding.GetEncoding("Latin1"));
data = data.Replace("\r\n", g.ToString()); //just so I have a unique placeholder
data = data.Replace("\n", "");
data = data.Replace(g.ToString(), "\r\n");
However, this fails on files that are bigger than a gigabyte or so. (Also, I haven't profiled it, but I suspect it's dog slow as well).
The tools I have at my disposal are:
- cygwin tools (sed, grep, etc)
- .NET
What is the best way to do this?