1

I have a CSV file that contains 110 rows with an extra semicolon character, which interrupts the read processing of the file. The extra semicolon delimiter occurs in the 3rd position in these lines.

I have the following code to find the 110 rows, but how can I expand on it to remove the 3rd semicolon in these lines?

awk -F \; 'NF != 14' file.csv

Example input:

;1000;"First; Name";132;GB; ... ;Field14;
;1000;(Thank; You-);126;IR; ... ;Field14;

Wanted output:

;1000;"First Name";132;GB; ... ;Field14;
;1000;(Thank You-);126;IR; ... ;Field14;

Where the semicolon inside the field containing "First; Name" and (Thank; You-) has been removed

bullfighter
  • 397
  • 1
  • 4
  • 21
  • @bullfighter what's your version of `awk`? Do you have `gawk` installed? – vgersh99 Mar 29 '21 at 15:12
  • 2
    Quoted delimiters are one of the reasons you should use a CSV parser to parse CSV. The other I can think of is escaped quotes. – Aaron Mar 29 '21 at 15:29
  • See [whats-the-most-robust-way-to-efficiently-parse-csv-using-awk](https://stackoverflow.com/questions/45420535/whats-the-most-robust-way-to-efficiently-parse-csv-using-awk) – Ed Morton Mar 29 '21 at 18:36
  • 1
    @Aaron the other common one is newlines within fields (e.g. if you export a CSV from Excel and some cells in the spreadsheet contained newlines). – Ed Morton Mar 29 '21 at 19:35

5 Answers5

4

And the GNU awk FPAT version:

$ gawk '
BEGIN {
    FPAT = "([^;]*)|(\"[^\"]+\")"
    OFS=";"
}
{
    gsub(/;/,"",$3)
}1' file

Output:

;1000;"First Name";132;GB; ... ;Field14;

More generically for each field:

$ gawk '
BEGIN {
    FPAT = "([^;]*)|(\"[^\"]+\")"
    OFS=";"
}
{
    for(i=1;i<=NF;i++)
        gsub(/;/,"",$i)
}1' file
James Brown
  • 36,089
  • 7
  • 43
  • 59
3

With your shown samples, could you please try following. Written and tested in GNU awk.

awk '
match($0,/"[^;]*[^"]*/){
  val=substr($0,RSTART,RLENGTH)
  gsub(/;/,"",val)
  print substr($0,1,RSTART-1) val substr($0,RSTART+RLENGTH)
}
'  Input_file
RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
3

another approach is treating quotes as the delimiters and removing the semicolons between quotes

$ awk 'BEGIN{FS=OFS="\""} {for(i=2;i<=NF;i+=2) gsub(/;/,"",$i)}1' file

;1000;"First Name";132;GB; ... ;Field14;

this may not work if you have escaped quote signs in quoted fields.

karakfa
  • 66,216
  • 7
  • 41
  • 56
2

Using sed:

sed -ri 's/(^.*\;.*\;\".*)(\;)(.*\"\;.*$)/\1\3/' file

Enable regular expression interpretation with -E or -r and then split the lines into three sections based on regular expressions, the first being the first 2 ";" delimited fields, and then the third up to a ";", the second section being the ";" and the third section the rest of the line. Replace the line with the first and third sections only.

Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
  • 1
    I suggest using `-E` by default rather than `-r`. If OP was using BSD `sed` `-E` would work while `-r` wouldn't, plus it's consistent with the equivalent `grep` flag and the name of the regex flavour (ERE). `-r` would be necessary for old GNU `sed` versions though. – Aaron Mar 29 '21 at 15:32
  • @Aaron is right that `-E` should be used but that usage of `-i` without a backup file wouldn't work on BSD sed so it'd need to be `sed -Ei '' 'script'` or similar for portability to both sed variants. – Ed Morton Mar 29 '21 at 19:32
1

I strongly recommend using CSV-specific tools on that CSV data. For example, with csvkit

csvformat -d ';' -D , file | tr -d ';' | csvformat -d , -D ';'
;1000;First Name;132;GB; ... ;Field14;

That converts the delimiters from semicolon to comma, deletes all semicolons, then replaces the delimiters back to semicolons.

The double quotes got removed in the process, but that's OK because the field no longer contains the delimiter character.


Or if you have installed:

ruby -rcsv -e 'CSV.filter(col_sep: ";") {|row| row[2].delete!(";")}' file
glenn jackman
  • 238,783
  • 38
  • 220
  • 352