Need a way to strip extra CRLFs from the middle of records

Question

We've got a data source that sends us files we need to automatically import into our system, but frequently there is bad data in them in the form of extra CRLFs inside one or two of the fields of some records. The file is a CSV format file and originally I had thought I'd fixed the problem with this:

awk 'NR%2-1{gsub(/\r?\n/, FS)} NR>1{printf RS}1' RS=\" ORS= input.csv > output.csv

which worked great for the records that had the extra CRLFs inside of quotes, but it turns out we're also getting some records that have no quotes in them which throws the awk command off.

The last field in each record is a numerical field consisting of 4-7 digits, is there some way to use that as a reference to keep the CRLF that follows that field and remove the rest?

Alternatively, is there some way to strip CRLFs by counting the fields and removing them prior to the last field in the record?

Input looks like this:

SMITH,John,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,"Northern District of California, USA",Individual,,,12/09/2003,18/08/2015,78452
SMITH,Patty,,,,,,05/10/1974,,USA,USA,A 5551212,"Resides in Oregon, USA",,,,Portland,,,,,USA,"District of Oregon, USA",Individual,,,15/09/2002,02/02/2015,121567

The problem records look like below - note the carriage return and missing quotes:

SMITH,Bill,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,Northern District of
California, USA,Individual,,,12/09/2003,18/08/2015,78452

For output we need the record to be the same as the other two - all on one line:

SMITH,Bill,,,,,,05/10/1966,,USA,USA,A 5551212,"Resides in California, USA",,,,Mill Valley,,,,,USA,Northern District of California, USA,Individual,,,12/09/2003,18/08/2015,78452

Show sample valid and broken input with and without the quotes? And your desired output for those inputs. — Etan Reisner, Oct 15 '15 at 16:51
The `csv.awk` script mentioned in http://stackoverflow.com/q/4205431/258523 might be of use to you here. — Etan Reisner, Oct 15 '15 at 17:40
Fix the source. The problem of whether the data is really extra is undecidable. — user207421, Oct 15 '15 at 18:38
@EJP: "The last field in each record is a numerical field consisting of 4-7 digits" makes it decidable" (with the reasonable assumption that there's no line break inside _that_ field). — mklement0, Oct 15 '15 at 19:03
@EJP: The problem is we don't control the source so fixing it sadly isn't an option for us. That would absolutely be the best solution. — barc0001, Oct 15 '15 at 21:13

mklement0 · Accepted Answer · 2015-10-15T18:59:15.303

1

If all you need is the removal of the field-internal CRLFs, try the following (assumes GNU awk, but it could be made to work with BSD awk as well):

awk -v RS='\r?\n' '/,[[:digit:]]{4,7}$/ { print; next } { printf("%s ", $0) }' input.csv > output.csv

/,[[:digit:]]{4,7}$/ matches only lines that end in 4-7 digits, implying that the line at hand is either a complete record or is a multi-line record's last line.
- { print; next } simply prints the line with a terminating \n (if you wanted \r\n on output too, you'd have to useprintf("%s\r\n", $0) instead).
{ printf("%s ", $0) } is then only printed for record fragments, i.e., a record that has a field-internal CRLF and therefore continues on the next line; by printing it with printf and just a trailing space, the net effect is that multiple lines comprising a single record are effectively joined with a space each on output.

edited Oct 15 '15 at 18:59

answered Oct 15 '15 at 18:51

mklement0

382,024
64
607
775

Using GNU awk 3.1.7. the initial code strips all the CRLFs but leaves everything as one large line. Changing { print; next } to { printf("%s\r\n", $0) } doesn't seen to have any effect on output either. Did I misinterpret? – barc0001 Oct 15 '15 at 20:56
@barc0001: v3.1.7 is quite old - current is 4.1.3; can you [upgrade](http://ftp.gnu.org/gnu/gawk/)? Aside from that: is the input all-CRLF, or a mix of CRLF and LF? Do you want the output to be all-CRLF or all-LF? – mklement0 Oct 15 '15 at 21:33
Upgrading can be done yes. Didn't realize it was that outdated. The input is all CRLF yes, and output of CRLF is what's expected on the next step of the process. – barc0001 Oct 15 '15 at 21:45
1

Using the new version of gawk functions as expected, bad news is that a test run against one of our input files shows there are a couple of records that have a 4 or 5 digit number in front of the mid-record CRLF which means this approach still won't fix everything unfortunately. Thanks for the help mklement0, your code works perfectly based on what I asked but it seems I didn't have a full scope of the problem before I asked for help. Will have to try some sort of fixed field record processing instead. – barc0001 Oct 15 '15 at 22:12
@barc0001: My pleasure, and thanks for accepting. You could try to combine the command with _counting fields_ to catch these cases too, but note that in order to count fields accurately - to account for quoted fields - you'll probably have to use GNU Awk's `FPAT` variable. – mklement0 Oct 15 '15 at 22:33

Need a way to strip extra CRLFs from the middle of records

1 Answers1