0

Currently I have a CSV file that contains address information, the problem I have is that the address field has embedded new line characters so when I try and load it into MySQL it doesn't work.

Here is an example of what my csv file looks like:

ID|NAME|ADDRESS|PHONE
1|"JOHN DOE"|"1 SHORT ROAD LONDON UNITED KINGDOM"|"01234 567 890"
2|"JANE DOE"|"1 SHORT ROAD LONDON 
UNITED KINGDOM"|"01234 567 890"
3|"BOB DOE"|"1
SHORT ROAD
LONDON
UNITED KINGDOM"|"01234 567 890"

And Here is what I want it to look like:

ID|NAME|ADDRESS|PHONE
1|"JOHN DOE"|"1 SHORT ROAD LONDON UNITED KINGDOM"|"01234 567 890"
2|"JANE DOE"|"1 SHORT ROAD LONDON UNITED KINGDOM"|"01234 567 890"
3|"BOB DOE"|"1 SHORT ROAD LONDON UNITED KINGDOM"|"01234 567 890"

My initial approach was to count the amount of '|' in the header line and have that as the target count.

typeset -i target
target=`head -1 broken.csv | awk -F\| '{print NF-1}'`

Then what I wanted to do was to iterate through the file line by line doing the same count on each line

while read -r line
do
count=`echo $line | awk -F\| '{print NF-1}'`

But I can't figure out the code for the next bit

What I want to do is

if count == target
echo $line >> fixed.csv
else
#append the next line and do the count again

I think I need a nested loop but I can't quite figure out the syntax so was hoping someone could give me some insight.

Any help much appreciated :)

  • 3
    First of all first CSV file is valid, so if you can't import it to mysql than your import should be fixed rather than the file. – pawel7318 Jan 06 '16 at 10:52

5 Answers5

0

One way is to remove newlines not followed by '<ID>|"' (or EOF), using a negative look-ahead assertion (supported by perl, python etc.):

perl -0pe 's/\n(?!(\d+\|"|\z))//g' file
Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
0

Please check my comment to your question. However if you want to stick to your way of fixing it than this should do what you need:

cat not-really-broken.csv |sed -e ':a;N;$!ba;s/\([^"]\)\n/\1 /g' -e 's/ /\n/'

Main idea is to replace all \n with (space) but only if it's not after " and to skip the firs line.

Please note this can fail for some other CSV files as those can use different character for quoting or omit it for some fields.

References:

Have sed make substitute on string but SKIP first occurrence

How can I replace a newline (\n) using sed?

What regex will match every character except comma ',' or semi-colon ';'?

Community
  • 1
  • 1
pawel7318
  • 3,383
  • 2
  • 28
  • 44
0

You can remove all newlines from the file using the tr command. removing new line character from incoming stream using sed

cat address_leveling.dat|tr -d '\n' > address_leveling2.dat

Then use sed to search for the pattern '1 or more digits followed by | and "' and then send that to a third file.

sed -e 's/[0-9]\{1,\}|"/\'$'\n&/g' address_leveling2.dat

And manually remove the first empty line.

Community
  • 1
  • 1
kometen
  • 6,536
  • 6
  • 41
  • 51
  • If you remove all the newlines, how do you tell the records apart? – ghoti Jan 06 '16 at 16:59
  • I look for the pattern 'one or more digits ([0-9]\{1,\}) followed by the pipe symbol | and then the double quote " '. When I encounter this I prepend a new line. Rather than trying to catch three-four different ways a record can be spread I find it easer just to have one pattern to look for. – kometen Jan 06 '16 at 17:19
  • Ah, matching the pattern of the index, and assuming that no other fields will be numeric. I understand now. I wouldn't recommend this solution, as it's too specific to the OP's sample data to be of general use outside this question. – ghoti Jan 06 '16 at 17:28
  • Thank you for your feedback. Yes, it can be nice to have a more generic solution but sometimes I find it easier to break it into a few small steps. The unix pipe sort of thinking. :-) – kometen Jan 06 '16 at 17:36
0
#awk -F\| ...
{ fc += NF; s += sep $0; sep = " " }
fc == 4 { print s; fc = 0; s = ""; sep = "" }

Accumulate fields until reaching four. Print and reset. Change sep to leave a marker where the line breaks used to be if necessary.

shawnt00
  • 16,443
  • 3
  • 17
  • 22
0

Here's a solution that works on OSX:

sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/ /g' -e 's/\([0-9][0-9]*|\)/\'$'\n\\1/g' testfile.csv

The strategy is to replace all newlines with a white space and then insert newlines right before pattern number| (number followed by pipe).

If you're not on OSX, the code will look much cleaner (but I can't test this at the moment):

sed ':a;N;$!ba;s/\n/ /g' -e 's/\([0-9]+|\)/\n\1/g' testfile.csv

Actually, this is pretty much the same strategy as pawel7318's solution. Except his won't work on OSX due to quirks (see OSX sed newlines - why conversion of whitespace to newlines works, but newlines are not converted to spaces).

Community
  • 1
  • 1
econ
  • 547
  • 7
  • 22