Have a large csv file with some line feeds in quoted columns. I need to run a shell script, sed, awk, perl are fine and replace only the line feeds inside quotes with a space. The end of line must be preserved and I don't know the number of columns or which fields might have these embedded line feeds.
Further examination of the file shows this as a result of cat -v test_lf.csv
"NORTH ?M-^@?OLMSTED"
"PORT?M-^@?ST?M-^@?LUCIE"
in the csv file which in excel shows a line break where the ?M-^@? is.
I'd like to use tr to replace whatever that is with a space. How would I do that? What is that sequence?
I've now found that a small part of the file looks like the following in characters and in hex.
13:20:59 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -x
0000000 3431 3136 3439 3039 2c39 542c 4c45 3543
0000020 5f36 3430 2c47 4c46 332c 3934 3338 312c
0000040 3634 3931 3934 3930 222c 4f50 5452 80e2
0000060 53a8 e254 a880 554c 4943 2245 462c 2c4c
0000100 3433 3839 2c33 3737 2c32 3237 2c37 3535
0000120 2c2c 5441 334c 2c51 6e43 7463 222c 6f4e
0000140 80e2 4da8 6773 2c22 2c34 3832 312c 0d35
0000160 000a
0000161
13:21:50 [woo:~/Development/scripts] > tail -n 8 test_lf.csv | head -n 1 | od -c
0000000 1 4 6 1 9 4 9 0 9 , , T E L C 5
0000020 6 _ 0 4 G , F L , 3 4 9 8 3 , 1
0000040 4 6 1 9 4 9 0 9 , " P O R T
**
0000060 ** S T
** ** L U C I E " , F L ,
0000100 3 4 9 8 3 , 7 7 2 , 7 2 7 , 5 5
0000120 , , A T L 3 Q , C n c t , " N o
0000140
** ** M s g " , 4 , 2 8 , 1 5 \r
0000160 \n
I'm on a mac and the file is utf-8, and my locale is utf-8. It appears that the bytes are swapped (little-endian), so hexadecimal 3431 3136 is 1463 in the character representation. So, starting at byte 60 in this output, we have:
something, S, T, blank, something, something, L and in hex this is:
53a8 e254 a880 554c, 53 is an S, 54 is a T, and 4c is an L . So, there is an e2, a8, 80 sequence between the T and the L. This produces a line-feed in the Excel spreadsheet field.
How do I find and replace these bytes by a space?