1

In the following file I want to replace all the ; by , with the exception that, when there is a string (delimited with two "), it should not replace the ; inside it.

Example: Input

A;B;C;D
5cc0714b9b69581f14f6427f;5cc0714b9b69581f14f6428e;1;"5cc0714b9b69581f14f6427f;16a4fba8d13";xpto;
5cc0723b9b69581f14f64285;5cc0723b9b69581f14f64294;2;"5cc0723b9b69581f14f64285;16a4fbe3855";xpto;
5cc072579b69581f14f6428a;5cc072579b69581f14f64299;3;"5cc072579b69581f14f6428a;16a4fbea632";xpto;

output

A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,

For sed I have: sed 's/;/,/g' input.txt > output.txt but this would replace everything.

The regex for the " delimited string: \".*;.*\" .

(A regex for hexadecimal would be better -- something like: [0-9a-fA-F]+)

My problem is combining it all to make a grep -o / sed that replaces everything except for that pattern.

The file size is in the order of two digit Gb (max 99Gb), so performance is important. Relevant.

Any ideas are appreciated.

JonyD
  • 1,237
  • 3
  • 21
  • 34

3 Answers3

2

sed is for doing simple s/old/new on individual strings. grep is for doing g/re/p. You're not trying to do either of those tasks so you shouldn't be considering either of those tools. That leaves the other standard UNIX tool for manipulating text - awk.

You have a ;-separated CSV that you want to make ,-separated. That's simply:

$ awk -v FPAT='[^;]*|"[^"]+"' -v OFS=',' '{$1=$1}1' file
A,B,C,D
5cc0714b9b69581f14f6427f,5cc0714b9b69581f14f6428e,1,"5cc0714b9b69581f14f6427f;16a4fba8d13",xpto,
5cc0723b9b69581f14f64285,5cc0723b9b69581f14f64294,2,"5cc0723b9b69581f14f64285;16a4fbe3855",xpto,
5cc072579b69581f14f6428a,5cc072579b69581f14f64299,3,"5cc072579b69581f14f6428a;16a4fbea632",xpto,

The above uses GNU awk for FPAT. See What's the most robust way to efficiently parse CSV using awk? for more details on parsing CSVs with awk.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
0

If I get correctly your requirements, one option would be to make a three pass thing.

From your comment about hex, I'll consider nothing like # will come in the input so you can do (using GNU sed) :

sed -E 's/("[^"]+);([^"]+")/\1#\2/g' original > transformed
sed -i 's/;/,/g' transformed
sed -i 's/#/;/g' transformed

The idea being to replace the ; when within quotes by something else and write it to a new file, then replace all ; by , and then set back the ; in place within the same file (-i flag of sed).

The three pass can be combined in a single command with

sed -E 's/("[^"]+);([^"]+")/\1#\2/g;s/;/,/g;s/#/;/g' original > transformed

That said, there's probably a bunch of csv parser witch already handle quoted fields that you can probably use in the final use case as I bet this is just an intermediary step for something else later in the chain.

From Ed Morton's comment: if you do it in one pass, you can use \n as replacement separator as there can't be a newline in the text considered line by line.

Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • very well thought. Simple and effective. Thanks! – JonyD Jul 19 '19 at 13:09
  • 1
    I don't see why you can't do that in one sed pass. – stevesliva Jul 19 '19 at 13:09
  • @Tensibai It's not working. The # is not being replaced in your first sed. even adding /g at the end, nothing is replaced – JonyD Jul 19 '19 at 13:15
  • @stevesliva you mean the 3 commands in one sed call, yep it's doable but I had wish to keep it simple, the overhead is not that much IIRC but I may be wrong – Tensibai Jul 19 '19 at 13:24
  • 1
    @Tensibai I found it. Needs -E and its \1 instead of $1. ```sed -E 's/("[^"]+);([^"]+")/\1#\2/g' PhaseChanges_orig.csv > transformed``` – JonyD Jul 19 '19 at 13:27
  • Aww, I was wondering why I couldn't repro, reading your comment I found it, I've an alias :( #shame – Tensibai Jul 19 '19 at 13:30
  • I mention that it doesn't require several passes mostly to make it clear that sed doesn't require it here. There are reasons to *have to* pipe sed to sed... but sequential manipulations within a single line isn't one of them. That said, if you have intermediate files, it can make debug easier, and that's a valid reason to separate them. – stevesliva Jul 19 '19 at 13:31
  • @stevesliva I've added the alternative to be complete – Tensibai Jul 19 '19 at 13:35
  • @JonyD We have edited at the same time and yours had been automatically rejected, sorry about it :) I hope I didn't miss a point – Tensibai Jul 19 '19 at 13:36
  • 1
    If you did it all in a single call to sed then you could use `\n` instead of `#` as the temp char since you KNOW there can't be newlines in the data (unlike `#` which you're just HOPING won't be in the data). You should mention your solution requires GNU sed for various constructs it uses. – Ed Morton Jul 19 '19 at 15:48
0

This might work for you (GNU sed):

sed -E ':a;s/^([^"]*("[^"]*"[^"]*)*"[^";]*);/\1\n/;ta;y/;/,/;y/\n/;/' file

Replace ;'s inside double quotes with newlines, transpose ;'s to ,'s and then transpose newlines to ;'s.

potong
  • 55,640
  • 6
  • 51
  • 83