5

I have a .csv file that contains double quoted multi-line fields. I need to convert the multi-line cell to a single line. It doesn't show in the sample data but I do not know which fields might be multi-line so any solution will need to check every field. I do know how many columns I'll have. The first line will also need to be skipped. I don't how much data so performance isn't a consideration.

I need something that I can run from a bash script on Linux. Preferably using tools such as awk or sed and not actual programming languages.

The data will be processed further with Logstash but it doesn't handle double quoted multi-line fields hence the need to do some pre-processing.

I tried something like this and it kind of works on one row but fails on multiple rows.

sed -e :0 -e '/,.*,.*,.*,.*,/b' -e N -e '1n;N;N;N;s/\n/ /g' -e b0 file.csv

CSV example

First name,Last name,Address,ZIP
John,Doe,"Country

City
Street",12345

The output I want is

First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
Jane,Doe,Country City Street,67890
etc.
etc.
sjaak
  • 644
  • 1
  • 7
  • 18
  • 3
    I don't think this is possible without a proper CSV parser. You need to count quotes, ignoring escaped ones, and remove a newline if it's after an odd number of quotes. That's not a regular expression kind of task. – joanis Aug 06 '19 at 01:28
  • 2
    Here are two different but related questions about parsing CSVs with awk - note that your case is the hardest one discussed in these questions: https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk https://stackoverflow.com/questions/4205431/parse-a-csv-using-awk-and-ignoring-commas-inside-a-field If you can save an awk script where you need to do this, I think the first link might especially help you - search for "replace newlines with blanks" to find the relevant part. – joanis Aug 06 '19 at 01:34
  • Thanks for the links. I tried some of the samples but they don't seem to do anything. Given that this seems to be a rather complex problem I think it is probably best to tell the person that provides the .csv to change the format. Unfortunately I can't spend a lot of time on this. I was hoping something more "plug & play" might exist. – sjaak Aug 06 '19 at 04:16
  • Look at [“grep” a csv file including multi-lines fields?](https://stackoverflow.com/questions/43912227/grep-a-csv-file-including-multi-lines-fields/43914514?r=SearchResults&s=11|24.3678#43914514) for suggestions using `awk` for that purpose. – David C. Rankin Aug 06 '19 at 04:51

5 Answers5

2

First my apologies for getting here 7 months late...

I came across a problem similar to yours today, with multiple fields with multi-line types. I was glad to find your question but at least for my case I have the complexity that, as more than one field is conflicting, quotes might open, close and open again on the same line... anyway, reading a lot and combining answers from different posts I came up with something like this:

First I count the quotes in a line, to do that, I take out everything but quotes and then use wc:

quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes

If you think of a single multi-line field, knowing if the quotes are 1 or 2 is enough. In a more generic scenario like mine I have to know if the number of quotes is odd or even to know if the line completes the record or expects more information.

To check for even or odd you can use the mod operand (%), in general:

even % 2 = 0
odd % 2 = 1

For the first line:

  • Odd means that the line expects more information on the next line.
  • Even means the line is complete.

For the subsequent lines, I have to know the status of the previous one. for instance in your sample text:

First name,Last name,Address,ZIP
John,Doe,"Country

City
Street",12345

You can say line 1 (John,Doe,"Country) has 1 quote (odd) what means the status of the record is incomplete or open.

When you go to line 2, there is no quote (even). Nevertheless this does not mean the record is complete, you have to consider the previous status... so for the lines following the first one it will be:

  • Odd means that record status toggles (incomplete to complete).
  • Even means that record status remains as the previous line.

What I did was looping line by line while carrying the status of the last line to the next one:

incomplete=0
cat file.csv | while read line; do
    quotes=`echo $line | tr -cd '"' | wc -c` # Counts the quotes
    incomplete=$((($quotes+$incomplete)%2))  # Check if Odd or Even to decide status
    if [ $incomplete -eq 1 ]; then
        echo -n "$line " >> new.csv          # If line is incomplete join with next
    else
        echo "$line" >> new.csv              # If line completes the record finish
    fi
done

Once this was executed, a file in your format generates a new.csv like this:

First name,Last name,Address,ZIP
John,Doe,"Country  City Street",12345

I like one-liners as much as everyone, I wrote that script just for the sake of clarity, you can - arguably - write it in one line like:

i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv

I would appreciate it if you could go back to your example and see if this works for your case (which you most likely already solved). Hopefully this can still help someone else down the road...

Recovering the multi-line fields

Every need is different, in my case I wanted the records in one line to further process the csv to add some bash-extracted data, but I would like to keep the csv as it was. To accomplish that, instead of joining the lines with a space I used a code - likely unique - that I could then search and replace:

i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l ~newline~ " || echo "$l";done >new.csv

the code is ~newline~, this is totally arbitrary of course.

Then, after doing my processing, I took the csv text file and replaced the coded newlines with real newlines:

sed -i 's/ ~newline~ /\n/g' new.csv

References:

TL;DR

Run this:

i=0;cat file.csv|while read l;do i=$((($(echo $l|tr -cd '"'|wc -c)+$i)%2));[[ $i = 1 ]] && echo -n "$l " || echo "$l";done >new.csv

... and collect results in new.csv

I hope it helps!

Jorge Valentini
  • 397
  • 4
  • 17
1

If Perl is your option, please try the following:

perl -e '
while (<>) {
    $str .= $_;
}

while ($str =~ /("(("")|[^"])*")|((^|(?<=,))[^,]*((?=,)|$))/g) {
    if (($el = $&) =~ /^".*"$/s) {
        $el =~ s/^"//s; $el =~ s/"$//s;
        $el =~ s/""/"/g;
        $el =~ s/\s+(?!$)/ /g;
    }
    push(@ary, $el);
}

foreach (@ary) {
    print /\n$/ ? "$_" : "$_,";
}' sample.csv

sample.csv:

First name,Last name,Address,ZIP
John,Doe,"Country

City
Street",12345
John,Doe,"Country

City
Street",67890

Result:

First name,Last name,Address,ZIP
John,Doe,Country City Street,12345
John,Doe,Country City Street,67890
tshiono
  • 21,248
  • 2
  • 14
  • 22
  • Thanks for the script. Unfortunately it doesn't work a 100% on the actual data. I will tell the person I got the data to change the formatting if they want it processed further. I tried to get rid of the offending column by using cut but that doesn't work either because of the multi-lines. – sjaak Aug 06 '19 at 06:47
1

This might work for you (GNU sed):

sed ':a;s/[^,]\+/&/4;tb;N;ba;:b;s/\n\+/ /g;s/"//g' file

Test each line to see that it contains the correct number of fields (in the example that was 4). If there are not enough fields, append the next line and repeat the test. Otherwise, replace the newline(s) by spaces and finally remove the "'s.

N.B. This may be fraught with problems such as ,'s between "'s and quoted "'s.

potong
  • 55,640
  • 6
  • 51
  • 83
1

Try cat -v file.csv. When the file was made with Excel, you might have some luck: When the newlines in a field are a simple \n and the newline at the end is a \r\n (which will look like ^M), parsing is simple.

# delete all newlines and replace the ^M with a new newline.
        tr -d "\n" < file.csv| tr "\r" "\n"

# Above two steps with one command
        tr "\n\r" " \n" < file.csv

When you want a space between the joined line, you need an additional step.

tr "\n\r" " \n" < file.csv | sed '2,$ s/^ //'

EDIT: @sjaak commented this didn't work is his case.

When your broken lines also have ^M you still can be a lucky (wo-)man.
When your broken field is always the first field in double quotes and you have GNU sed 4.2.2, you can join 2 lines when the first line has exactly one double quote.

 sed -rz ':a;s/(\n|^)([^"]*)"([^"]*)\n/\1\2"\3 /;ta' file.csv

Explanation:
-z don't use \n as line endings
:a label for repeating the step after successful replacement
(\n|^) Search after a newline or the very first line
([^"]*) Substring without a "
ta Go back to label a and repeat

Walter A
  • 19,067
  • 2
  • 23
  • 43
  • Running cat -v does show ^M. The problem is that is also shows ^M at the end of each row so when I run those commands I end up with a single line but instead I need to convert only the multi-line cells into a single line. – sjaak Aug 06 '19 at 08:37
  • @sjaak When the second try also fails, you should look for another solution (other answers or links given in comments). When this works, you might want to do something with the empty lines that will be converted into 2 spaces (Delete them first with `grep . | ` or replace at the end with `s/ / /g`). – Walter A Aug 06 '19 at 10:09
1

awk pattern matching is working. answer in one line :

  awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile

if you'd like to drop quotes, you could use:

  awk '/,"/{ORS=" "};/",/{ORS="\n"}{print $0}' YourFile | sed 's/"//gw NewFile'

but I prefer to keep it.

to explain the code:

  1. /Pattern/ : find pattern in current line.

  2. ORS : indicates the output line record.

  3. $0 : indicates the whole of the current line.

  4. 's/OldPattern/NewPattern/': substitude first OldPattern with NewPattern

  5. /g : does the previous action for all OldPattern

  6. /w : write the result to Newfile

Hossein Vatani
  • 1,381
  • 14
  • 26