42

I need to delete 2 columns in a comma seperated values file. Consider the following line in the csv file:

"abc@xyz.com,www.example.com",field2,field3,field4
"def@xyz.com",field2,field3,field4

Now, the result I want at the end:

"abc@xyz.com,www.example.com",field4
"def@xyz.com",field4

I used the following command:

awk 'BEGIN{FS=OFS=","}{print $1,$4}'

But the embedded comma which is inside quotes is creating a problem, Following is the result I am getting:

"abc@xyz.com,field3
"def@xyz.com",field4

Now my question is how do I make awk ignore the "," which are inside the double quotes?

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Deepak K M
  • 521
  • 1
  • 5
  • 13

4 Answers4

62

From the GNU awk manual (http://www.gnu.org/software/gawk/manual/gawk.html#Splitting-By-Content):

$ awk -vFPAT='([^,]*)|("[^"]+")' -vOFS=, '{print $1,$4}' file
"abc@xyz.com,www.example.com",field4
"def@xyz.com",field4

and see What's the most robust way to efficiently parse CSV using awk? for more generally parsing CSVs that include newlines, etc. within fields.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    I'm curious what happens internally? Here's a usage of the regex in perl that produces very different matches: `perl -lnE 'while(/([^,]*)|("[^"]+")/g){say "#$1#"}' <<<'Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA'` – Johannes Riecken Feb 13 '19 at 12:26
  • 2
    @rubystallion You'd have to ask the gawk developers about what happens internally but regexps usually match the leftmost-longest string so the fact that perl is matching on `"1234 A Pretty Street` and `NE"` as 2 **separate strings** seems to be wrong given that plus both the awk above and `grep -Eo '([^,]*)|("[^"]+")' <<<'Robbins,Arnold,"1234 A Pretty Street, NE",MyTown,MyState,12345-6789,USA'` identify `"1234 A Pretty Street, NE"` as a single string. Of course that perl script might contain some magical incantation that means "don't match leftmost-longest", idk as I don't get perl syntax. – Ed Morton Feb 13 '19 at 15:42
  • 1
    Thanks for the grep example, which pointed me to where to find the answer: The [POSIX spec](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_01) says: `If the pattern permits a variable number of matching characters and thus there is more than one such sequence starting at that point, the longest such sequence is matched. For example, the BRE "bb*" matches the second to fourth characters of the string "abbbc", and the ERE "(wee|week)(knights|night)" matches all ten characters of the string "weeknights".` – Johannes Riecken Feb 14 '19 at 10:38
  • 1
    Whereas the [perl spec](https://perldoc.perl.org/perlre.html#The-Basics) says `Alternatives are tried from left to right, so the first alternative found for which the entire expression matches, is the one that is chosen. This means that alternatives are not necessarily greedy. For example: when matching "foo|foot" against "barefoot", only the "foo" part will match, as that is the first alternative tried, and it successfully matches the target string.` – Johannes Riecken Feb 14 '19 at 10:38
  • Do you know if this works in mawk? – Data Origin Aug 14 '19 at 09:44
  • @DataOrigin Probably not but mawk has been adopting some gawk extensions so check the man page for whatever mawk version you're using or simply try it and see. – Ed Morton Aug 14 '19 at 10:05
  • 1
    for whatever reason this doesn't work properly on the awk (20070501) that comes pre-installed with mac os x. `662,ABILENE-SWEETWATER,KNOX,76363,TX` comes through as a single field ... not clear why seems like FPAT might not be supported ... – Ralph Callaway Sep 04 '19 at 03:18
  • 2
    @RalphCallaway Right, FPAT is a GNU awk extension. Install gawk or see the link in my answer for a solution that'll work in any awk. – Ed Morton Sep 04 '19 at 04:03
  • To clarify Ed Morton's point, this command doesn't work in all versions of unix `awk` but does work in GNU's `gawk`. – Ben Wheeler Jun 28 '22 at 02:35
  • if the strings are encapsulated in apostrophes instead of quotes, this works: `-vFPAT="([^,]*)|('[^']+')"` – Matiaan Sep 23 '22 at 07:00
16

This is not a bash/awk solution, but I recommend CSVKit, which can be installed by pip install csvkit. It provides a collection of command line tools to work specifically with CSV, including csvcut, which does exactly what you ask for:

csvcut --columns=1,4 <<EOF
"abc@xyz.com,www.example.com",field2,field3,field4
"def@xyz.com",field2,field3,field4
EOF

Output:

"abc@xyz.com,www.example.com",field4
def@xyz.com,field4

It strips the unnecessary quotes, which I suppose shouldn't be a problem.

Read the docs of CSVKit here on RTD. ThoughtBot has a nice little blog post introducing this tool, which is where I learnt about CSVKit.

4ae1e1
  • 7,228
  • 8
  • 44
  • 77
4

In your sample input file, it is the first field and only the first field, that is quoted. If this is true in general, then consider the following as a method for deleting the second and third columns:

$ awk -F, '{for (i=1;i<=NF;i++){printf "%s%s",(i>1)?",":"",$i; if ($i ~ /"$/)i=i+2};print""}' file
"abc@xyz.com,www.example.com",field4
"def@xyz.com",field4

As mentioned in the comments, awk does not natively understand quoted separators. This solution works around that by looking for the first field that ends with a quote. It then skips the two fields that follow.

The Details

  • for (i=1;i<=NF;i++)

    This starts a for over each field i.

  • printf "%s%s",(i>1)?",":"",$i

    This prints field i. If it is not the first field, the field is preceded by a comma.

  • if ($i ~ /"$/)i=i+2

    If the current field ends with a double-quote, this then increments the field counter by 2. This is how we skip over fields 2 and 3.

  • print""

    After we are done with the for loop, this prints a newline.

John1024
  • 109,961
  • 14
  • 137
  • 171
2

This awk should work regardless of where the quoted field is and works on escaped quotes as well.

awk '{while(match($0,/"[^"]+",|([^,]+(,|$))/,a)){
      $0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
      print b[1] b[4];x=0}' file

Input

"abc@xyz.com,www.example.com",field2,field3,field4  
"def@xyz.com",field2,field3,field4  
field1,"abc@xyz.com,www.example.com",field3,field4  

Output

"abc@xyz.com,www.example.com",field4
"def@xyz.com",field4
field1,field4

It even works on

field1,"field,2","but this field has ""escaped"\" quotes",field4

That the mighty FPAT variable fails on !


Explanation

 while(match($0,/"[^"]+",|([^,]+(,|$))/,a))

Starts a while loop that continues as long as the match is a success(i.e there is a field).
The match matches the first occurence of the regex which incidentally matches the fields and store it in array a

 $0=substr($0,RSTART+RLENGTH);b[++x]=a[0]

Sets $0 to begin at the end of matched field and adds the matched field to the corresponding array position in b.

  print b[1] b[4];x=0}

Prints the fields you want from b and sets x back to zero for the next line.


Flaws

Will fail if field contains both escaped quotes and a comma


Edit

Updated to support empty fields

awk '{while(match($0,/("[^"]+",|[^,]*,|([^,]+$))/,a)){
     $0=substr($0,RSTART+RLENGTH);b[++x]=a[0]}
     print b[1] b[4];x=0}' file
  • It will also fail if a field is empty, e.g. `foo,,bar`. – Ed Morton Apr 15 '15 at 13:05
  • @EdMorton fixed i think ? –  Apr 15 '15 at 13:22
  • 1
    looks better. now you need to add `delete b` when you set `x=0` or `b` will retain the contents of the end of the previous record when the current record has fewer fields, e.g. `print b[3]` for input lines`a,b,c` followed by just `d,e` would output `c` twice. – Ed Morton Apr 15 '15 at 13:32
  • @EdMorton Yeah oi thought of that but given OP's question i presumed that there will always be a 4th field. –  Apr 15 '15 at 14:25
  • This causes an "illegal statement" syntax error for me with awk in bash – Ben Wheeler Jun 28 '22 at 02:30