SEd: replace whitespace characters with single comma except inside quotes

Question

This line is from a car dataset (https://archive.ics.uci.edu/ml/datasets/Auto+MPG) looking like this:

15.0   8.   429.0      198.0      4341.      10.0   70.  1.     "ford galaxie 500"

how would one replace the multiple whitespace (it has both space and tabs) w/ a single comma, but not inside the quotes, preferably using sed,to turn the dataset into a REAL csv. Thanks!

Maybe this will help: http://stackoverflow.com/questions/14916159/sed-replace-spaces-within-quotes-with-underscores — John Zwinck, Jan 21 '15 at 07:30
I tried, $ sed 's/[^"] [^"]//g' data/auto-mpg.data-original $ sed 's/[^"][ \t][^"]/,/g' data/auto-mpg.data-original $ sed 's/[^"][ \t]*[^"]/,/g' data/auto-mpg.data-original $ sed 's/[^"][ \t][^"]/,/g' data/auto-mpg.data-original $ sed 's/[ \t]/,/g;s/,,,//g' data/auto-mpg.data $ sed 's/[ \t]/,/g' data/auto-mpg.data $ perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge' data/auto-mpg.data $ sed 's/$.*"$,/\1 /' data/auto-mpg.data $ sed 's/$.*\"$,/\1 /g' data/auto-mpg.data-commad — importError, Jan 21 '15 at 08:24

Wintermute · Accepted Answer · 2015-01-21T10:12:26.703

6

Do it with awk:

awk -F'"' 'BEGIN { OFS="\"" } { for(i = 1; i <= NF; i += 2) { gsub(/[ \t]+/, ",", $i); } print }' filename.csv

Using " as the field separator, every second field is going to be a part of the line where spaces should be replaced. Then:

BEGIN { OFS = FS }               # output should also be separated by "
{
  for(i = 1; i <= NF; i += 2) {  # in every second field
    gsub(/[ \t]+/, ",", $i)      # replace spaces with commas
  }
  print                          # and print the whole shebang
}

edited Jan 21 '15 at 10:12

answered Jan 21 '15 at 09:32

Wintermute

42,983
5
77
80

Thank you for your answer, i will let you know after getting the chance to try it out. – importError Jan 23 '15 at 04:22

potong · Answer 2 · 2015-01-21T12:15:27.253

0

This might work for you (GNU sed):

sed 's/\("[^"]*"\|[0-9.]*\)\s\s*/\1,/g' file

This takes a quoted string or a decimal number followed by white space and replaces the white space by a comma - throughout each and every line.

To be less specific use (as per comments):

sed -r 's/("[^"]*"|\S+)\s+/\1,/g' file

edited Jan 21 '15 at 12:15

answered Jan 21 '15 at 09:37

potong

55,640
6
51
83

That confused some of my test inputs, and I drew false conclusions at first (sorry about the first comment). There's a typo in your pattern: the closing paren should be escaped, and may I suggest to replace `[0-9.]` with `[^[:space:]]` to make it work with non-numeric unquoted tokens? That is: `s/$"[^"]*"\|[^[:space:]]*$\s\s*/\1,/g` – Wintermute Jan 21 '15 at 10:11
Thank you for your answer, i will let you know after getting the chance to try it out. Thanks also to whoever pointed out using sed instead of awk would be silly. – importError Jan 23 '15 at 04:22

SEd: replace whitespace characters with single comma except inside quotes

2 Answers2

Linked