AWK - replace extra semicolon separators in a csv-file

Question

I have a CSV file that contains 110 rows with an extra semicolon character, which interrupts the read processing of the file. The extra semicolon delimiter occurs in the 3rd position in these lines.

I have the following code to find the 110 rows, but how can I expand on it to remove the 3rd semicolon in these lines?

awk -F \; 'NF != 14' file.csv

Example input:

;1000;"First; Name";132;GB; ... ;Field14;
;1000;(Thank; You-);126;IR; ... ;Field14;

Wanted output:

;1000;"First Name";132;GB; ... ;Field14;
;1000;(Thank You-);126;IR; ... ;Field14;

Where the semicolon inside the field containing "First; Name" and (Thank; You-) has been removed

@bullfighter what's your version of `awk`? Do you have `gawk` installed? — vgersh99, Mar 29 '21 at 15:12
Quoted delimiters are one of the reasons you should use a CSV parser to parse CSV. The other I can think of is escaped quotes. — Aaron, Mar 29 '21 at 15:29
See [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) — Ed Morton, Mar 29 '21 at 18:36
@Aaron the other common one is newlines within fields (e.g. if you export a CSV from Excel and some cells in the spreadsheet contained newlines). — Ed Morton, Mar 29 '21 at 19:35

James Brown · Answer 1 · 2021-03-30T04:37:55.060

4

And the GNU awk FPAT version:

$ gawk '
BEGIN {
    FPAT = "([^;]*)|(\"[^\"]+\")"
    OFS=";"
}
{
    gsub(/;/,"",$3)
}1' file

Output:

;1000;"First Name";132;GB; ... ;Field14;

More generically for each field:

$ gawk '
BEGIN {
    FPAT = "([^;]*)|(\"[^\"]+\")"
    OFS=";"
}
{
    for(i=1;i<=NF;i++)
        gsub(/;/,"",$i)
}1' file

edited Mar 30 '21 at 04:37

answered Mar 29 '21 at 15:54

James Brown

36,089
7
43
59

score 3 · Answer 2 · answered Mar 29 '21 at 15:33

3

With your shown samples, could you please try following. Written and tested in GNU awk.

awk '
match($0,/"[^;]*[^"]*/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/;/,"",val)
  print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
'  Input_file

answered Mar 29 '21 at 15:33

RavinderSingh13

130,504
14
57
93

2

(You might also limit it to find at least a single `;` between double quotes `"[^;"]*;[^"]*"`) – The fourth bird Mar 29 '21 at 16:10
1

That will behave the same in any awk, not just GNU. – Ed Morton Mar 29 '21 at 19:40

score 3 · Answer 3 · answered Mar 30 '21 at 01:30

another approach is treating quotes as the delimiters and removing the semicolons between quotes

$ awk 'BEGIN{FS=OFS="\""} {for(i=2;i<=NF;i+=2) gsub(/;/,"",$i)}1' file

;1000;"First Name";132;GB; ... ;Field14;

this may not work if you have escaped quote signs in quoted fields.

score 2 · Answer 4 · answered Mar 29 '21 at 15:24

2

Using sed:

sed -ri 's/(^.*\;.*\;\".*)(\;)(.*\"\;.*$)/\1\3/' file

Enable regular expression interpretation with -E or -r and then split the lines into three sections based on regular expressions, the first being the first 2 ";" delimited fields, and then the third up to a ";", the second section being the ";" and the third section the rest of the line. Replace the line with the first and third sections only.

answered Mar 29 '21 at 15:24

Raman Sailopal

12,320
2
11
18

1

I suggest using `-E` by default rather than `-r`. If OP was using BSD `sed` `-E` would work while `-r` wouldn't, plus it's consistent with the equivalent `grep` flag and the name of the regex flavour (ERE). `-r` would be necessary for old GNU `sed` versions though. – Aaron Mar 29 '21 at 15:32
@Aaron is right that `-E` should be used but that usage of `-i` without a backup file wouldn't work on BSD sed so it'd need to be `sed -Ei '' 'script'` or similar for portability to both sed variants. – Ed Morton Mar 29 '21 at 19:32

glenn jackman · Answer 5 · 2021-03-29T20:59:53.520

I strongly recommend using CSV-specific tools on that CSV data. For example, with csvkit

csvformat -d ';' -D , file | tr -d ';' | csvformat -d , -D ';'

;1000;First Name;132;GB; ... ;Field14;

That converts the delimiters from semicolon to comma, deletes all semicolons, then replaces the delimiters back to semicolons.

The double quotes got removed in the process, but that's OK because the field no longer contains the delimiter character.

Or if you have ruby installed:

ruby -rcsv -e 'CSV.filter(col_sep: ";") {|row| row[2].delete!(";")}' file

AWK - replace extra semicolon separators in a csv-file

5 Answers5