2

I have an ASCII file (csv, for instance) that uses , as separator, however this char is also present in quoted strings:

3,       "hh,1,foo",            foo
"5,,,5", "1,2,3d,,,something ", foo2
test,    "col3",                foo3

To avoid such ambiguity I want to replace the , separator by the ; one. How to do that with command line? (under linux).


UPDATE/extra question:

There are at least two options using sed or awk.

My extra question is which one is the fastest? (this can be important for big csv files).

Picaud Vincent
  • 10,518
  • 5
  • 31
  • 70
  • So you wanted to replace all `,` or only specific ones, could you plese mention more clear details here? – RavinderSingh13 Nov 14 '17 at 09:22
  • 1
    @RavinderSingh13 I want to replace the ',' that are _outside_ my quoted strings, the ones that have a field "separator" role. – Picaud Vincent Nov 14 '17 at 09:26
  • If you want to know which one is the fastest, you can do basic benchmarking with e.g `time awk '...'` (remember to take into account caching, so run the test a few times until the result becomes stable). – Tom Fenech Nov 14 '17 at 09:54
  • @TomFenech yes, that is what I am currently doing.. I will report the results I get – Picaud Vincent Nov 14 '17 at 09:56
  • @TomFenech I just updated with some bench results – Picaud Vincent Nov 14 '17 at 10:12
  • 1
    All of the answers you got will fail for more interesting but still standards-conformant CSVs and almost all of them rely on non-standard tools and/or proprietary extensions to standard tools (e.g. GNU sed). If any of that matters to you, to do this job robustly, efficiently, and portably using just a standard UNIX tool available on all UNIX platforms see https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk. – Ed Morton Nov 14 '17 at 14:05
  • @EdMorton Thanks for your link, yes the procedure you describe is more robust (Windows line endings, etc.). Please also note that my initial question was not especially oriented towards csv files but towards the "separator in quoted string" problem in general. – Picaud Vincent Nov 14 '17 at 14:27
  • 1
    You're welcome. Your file format **is** [CSV](https://stackoverflow.com/tags/csv/info), even if you used some character other than comma as the separator. If your separator was a string instead of a single character **then** your format would not be CSV but then you'd have to post a new question as I doubt if any of the answers you got would work. – Ed Morton Nov 14 '17 at 14:39

4 Answers4

3

I think your requirement is the perfect use case for using FPAT in GNU Awk,

Normally, when using FS, gawk defines the fields as the parts of the record that occur in between each field separator. In other words, FS defines what a field is not, instead of what a field is. However, there are times when you really want to define the fields by what they are, and not by what they are not.

The most notorious such case is so-called comma-separated values (CSV) data. If commas only separated the data, there wouldn’t be an issue. The problem comes when one of the fields contains an embedded comma. In such cases, most programs embed the field in double quotes.

In the case of CSV data as presented here, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant (see Regexp), we would have ([^,]+)|([[:space:]]*\"[^\"]+\"). Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]+)|([[:space:]]*\"[^\"]+\")"

Using that on your file to output de-limit file on ; character do

awk -v OFS=';' 'BEGIN{FPAT = "([^,]+)|([[:space:]]*\"[^\"]+\")"}{$1=$1}1' file

Also you could speed-things up by playing with the locale set in your system. Forcing the locale settings to C will let you match the characters to the ASCII data set alone and not UTF-8, pass it locally to the command as

LC_ALL=C awk -v OFS=';' 'BEGIN{FPAT = "([^,]+)|([[:space:]]*\"[^\"]+\")"}{$1=$1}1' file

Since FPAT involves using standard regex parsers, it could perform slower than the ones involving non-regex substitutions.

Inian
  • 80,270
  • 14
  • 142
  • 161
2

For this not so complicated substitution, use sed

$ cat file
3,       "hh,1,foo",            foo
"5,,,5", "1,2,3d,,,something ", foo2
test,    "col3",                foo3
$ sed -E 's/,([[:space:]]*")/;\1/g;s/("[[:space:]]*),/\1;/g' file
3;       "hh,1,foo";            foo
"5,,,5"; "1,2,3d,,,something "; foo2
test;    "col3";                foo3

Or even shorter

# sed -E 's/,(\s*"[^"]*"\s*),/;\1;/g' file
3;       "hh,1,foo";            foo
"5,,,5"; "1,2,3d,,,something "; foo2
test;    "col3";                foo3

And the perl solution would be

# perl -ane  's/,(\s*"[^"]*"\s*),/;$1;/g;print' 47281774
3;       "hh,1,foo";            foo
"5,,,5"; "1,2,3d,,,something "; foo2
test;    "col3";                foo3
sjsam
  • 21,411
  • 5
  • 55
  • 102
  • This is also a solution. I have another question (maybe a possible update of my initial question): I have tested both (sed vs awk). I have the impression that the awk solution is faster than regexpr on big files. Do you observe that too? – Picaud Vincent Nov 14 '17 at 09:43
  • @PicaudVincent For larger files, `sed` should be faster. However for very large files, use `perl` which is the tool for the purpose. – sjsam Nov 14 '17 at 09:48
  • You win! your solution is the fastest! :) – Picaud Vincent Nov 14 '17 at 10:11
  • @PicaudVincent I have added the `perl` solution too. :) – sjsam Nov 14 '17 at 10:29
1

You can use awk command to do a search/replace only inside the " quoted parts.

The first step is to replace the , by _

cat demo.txt | awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(",","_",$i)} 1' 

which gives

3,       "hh_1_foo",            foo
"5___5", "1_2_3d___something ", foo2
test,    "col3",                foo3

Then replace the , by ; with the more usual tr command.

tr ',' ';' 

The last step uses awk again in a "reverse" way to replace the temporary _ placeholder into the initial , character.

Putting everything together we have:

cat demo.txt | 
awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(",","_",$i)} 1' | 
tr ',' ';' | 
awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub("_",",",$i)} 1'

which gives

3;       "hh,1,foo";            foo
"5,,,5"; "1,2,3d,,,something "; foo2
test;    "col3";                foo3

as expected.


UPDATE: the fastest solution?

I use the 3 answers I got to bench them on a 206Mb csv file (with several runs to take care of cache effect...), here are the typical results I get:

1/ My initial answer:

time cat avec_vapeur.csv | awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub(",","_",$i)} 1' |  tr ',' ';' |  awk 'BEGIN{FS=OFS="\""} {for(i=2;i<NF;i+=2)gsub("_",",",$i)} 1'  > /dev/null

real    0m2.488s
user    0m5.025s
sys     0m0.242s

2/ The alternative awk based solution: ravindersingh13

time cat avec_vapeur.csv | awk -F"\"" '{for(i=1;i<=NF;i+=2){gsub(/,/,";",$i)}} 1' OFS="\"" > /dev/null

real    0m4.705s
user    0m4.631s
sys     0m0.111s

3/ The sed based solution: sjsam

time cat avec_vapeur.csv | sed -E 's/,([[:space:]]*")/;\1/g;s/("[[:space:]]*),/\1;/g' > /dev/null 

real    0m0.174s
user    0m0.118s
sys     0m0.130s

-> The clear winner is the sed based solution!

The last answer I got: inian

time cat avec_vapeur.csv |  awk -v OFS=';' 'BEGIN{FPAT = "([^,]+)|([[:space:]]*\"[^\"]+\")"}{$1=$1}1' > /dev/null

real    0m37.507s
user    0m37.463s
sys     0m0.122s

which is also the slowest I tested (no judgement here, just done these tests for fun!)

update: I initially misread =inian=, sorry. If I understand you well, I add

LC_ALL=C

to speed up things.

Now I get:

real    0m20.268s
user    0m20.008s
sys     0m0.087s

which is faster but not as fast as the sed solution.

Now game is over, no more bench from me (I have to work a little bit too)

Last words for the winner, perl solution: sjsam

time cat avec_vapeur.csv | perl -ane  's/,(\s*"[^"]*"\s*),/;$1;/g;print' > /dev/null

real    0m0.134s
user    0m0.096s
sys     0m0.104s

which is even slightly faster than the sed one (at least with my tests)!

Picaud Vincent
  • 10,518
  • 5
  • 31
  • 70
  • You missed setting the `LC_ALL` setting here in my logic, which is key in awk for speeding up things. Also kindly do not `cat file | awk..` on any of the attempts. Update the bench mark results directly running on the file. The reason being an extra process of `cat` could be avoided to show even more accurate results. For e.g. my attempt you could use `time LC_ALL=C awk -v OFS=';' 'BEGIN{FPAT = "([^,]+)|([[:space:]]*\"[^\"]+\")"}{$1=$1}1' file` – Inian Nov 14 '17 at 10:25
  • @Inian, sorry I misread you, hold on I will fix that and bench again. sorry – Picaud Vincent Nov 14 '17 at 10:26
  • 1
    Appreciate your bench marking, it is no recommended to force the locale settings across the shell, it could potentially _affect_ other commands. Just use for the command locally – Inian Nov 14 '17 at 10:36
1

Considering your Input_file will be same as sample shown, if yes then following awk may help you in same too.

awk -F"\"" '{for(i=1;i<=NF;i+=2){gsub(/,/,";",$i)}} 1' OFS="\""   Input_file

Output will be as follows.

3;       "hh,1,foo";            foo
"5,,,5"; "1,2,3d,,,something "; foo2
test;    "col3";                foo3

EDIT: Adding explanation with non-one liner form of solution too here.

awk -F"\"" '{           ##Making " as a field separator as all the liens in Input_file.
for(i=1;i<=NF;i+=2){    ##Starting a for loop here from variable i value 1 to till value of NF(number of fields in a line) incrementing variable i by 2 here, so that we will NOT touch those commas which are coming in side " " here.
  gsub(/,/,";",$i)      ##Now using gsub functionality of awk which will globally substitute comma with semi colon in current fields value.
}}                      ##closing the block of for loop here.
1                       ##awk works on method of condition then action, so mentioning 1 means I am making condition as TRUE and NO action is defined so by default print of current line of Input_file will happen.
' OFS="\"" Input_file   ##Setting OFS(output field separator) as " and mentioning Inut_file name here.
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93