Awk consider double quoted string as one token and ignore space in between

Question

Data file - data.txt:

ABC "I am ABC" 35 DESC
DEF "I am not ABC" 42 DESC

cat data.txt | awk '{print $2}'

will result the "I" instead of the string being quoted

How to make awk so that it ignore the space within the quote and think that it is one single token?

score 24 · Answer 1 · answered Apr 04 '17 at 17:25

24

Another alternative would be to use the FPAT variable, that defines a regular expression describing the contents of each field.

Save this AWK script as parse.awk:

#!/bin/awk -f

BEGIN {
  FPAT = "([^ ]+)|(\"[^\"]+\")"
}
{
  print $2
}

Make it executable with chmod +x ./parse.awk and parse your data file as ./parse.awk data.txt:

"I am ABC"
"I am not ABC"

answered Apr 04 '17 at 17:25

mabalenk

887
1
8
17

1

Thank you for the regex! ;-) Saved me at least 20 minutes of frustrated attempts. +1 – jweyrich May 10 '17 at 17:42
1

This should be the accepted answer. It works like a charm, thanks. – Nico Feb 21 '18 at 23:02
1

this is the best answer, So I can use following command to convert my logs to prevent using FPAT every time: echo 'filed1 "the second field"' | awk 'BEGIN {FPAT = "([^ ]+)|(\"[^\"]+\")"}{for(i=1;i<=NF;i++){gsub(" ","%20",$i)} print}' – tinyhare Aug 13 '18 at 04:49
1

This works with GNU awk, but not the awk that ships with Mac OS X, so do `brew install gawk` if you're on a Mac. – Vinod Kurup Mar 24 '22 at 17:45

DigitalRoss · Accepted Answer · 2011-07-08T04:03:05.700

9

Yes, this can be done nicely in awk. It's easy to get all the fields without any serious hacks.

(This example works in both The One True Awk and in gawk.)

{
  split($0, a, "\"")
  $2 = a[2]
  $3 = $(NF - 1)
  $4 = $NF
  print "and the fields are ", $1, "+", $2, "+", $3, "+", $4
}

edited Jul 08 '11 at 04:03

answered Jul 08 '11 at 03:57

DigitalRoss

143,651
25
248
329

2

To format for a one liner: `cat data.txt | awk 'split($0, a, "\"") {$2 = a[2]} {$3 = $(NF - 1)} {$4 = $NF} {print "and the fields are ", $1, "+", $2, "+", $3, "+", $4}'` – Chris Gregg Jul 08 '11 at 04:21
12

This only works if you have a single quoted field, on the second position, and have 4 fields in total. It's not generic. A solution where it will accept any quoted field in any position will be ideal. – Joaquin Cuenca Abela Jan 19 '14 at 12:55

Chris Gregg · Answer 3 · 2011-07-08T03:27:53.380

8

Try this:

$ cat data.txt | awk -F\" '{print $2}'
I am ABC
I am not ABC

edited Jul 08 '11 at 03:27

answered Jul 08 '11 at 03:22

Chris Gregg

2,376
16
30

I should note that this isn't particularly generic -- it simply changes the field separator to `" and selects the second field. – Chris Gregg Jul 08 '11 at 03:27
But if I want to use the information before and after... it won't work =( – Roy Chan Jul 08 '11 at 03:35
@Roy Chan -- true. Awk is not really the right tool for parsing quoted strings. Go down to the third post [at this horribly formatted Google Cache link](http://webcache.googleusercontent.com/search?q=cache:HA9Ix2yPEasJ:forums11.itrc.hp.com/service/forums/questionanswer.do%3FthreadId%3D1028610+awk+quotes+field&cd=1&hl=en&ct=clnk&gl=us&client=safari&source=www.google.com) and you can see an example that is much longer but might help. – Chris Gregg Jul 08 '11 at 03:39
2

@DigitalRoss -- nice solution; I hadn't thought of that method. – Chris Gregg Jul 08 '11 at 04:08
@RoyChan This solution is usable by doubling the index (but only if every field has quotes). You can use a higher number, and it will work. Just add more to the index than you think, to account for the blank when there are two quotes in a row, since the quote is used as the field delimiter which generates more fields. For example, to use something after, do: `echo "'hello there' 'the world'" | awk -F\' '{print $4}'` (The result is "the world" without quotes). But if some have no quote: `echo "'hello there' the world" | awk -F\' '{print $3}'` yields " the world" without quotes but with a space. – Poikilos Apr 07 '20 at 15:12

score 4 · Answer 4 · edited May 23 '17 at 12:18

The top answer for this question only works for lines with a single quoted field. When I found this question I needed something that could work for an arbitrary number of quoted fields.

Eventually I came upon an answer by Wintermute in another thread, and he provided a good generalized solution to this problem. I've just modified it to remove the quotes. Note that you need to invoke awk with -F\" when running the below program.

BEGIN { OFS = "" } {
    for (i = 1; i <= NF; i += 2) {
        gsub(/[ \t]+/, ",", $i)
    }
    print
}

This works by observing that every other element in the array will be inside of the quotes when you separate by the "-character, and so it replaces the whitespace dividing the ones not in quotes with a comma.

You can then easily chain another instance of awk to do whatever processing you need (just use the field separator switch again, -F,).

Note that this might break if the first field is quoted - I haven't tested it. If it does, though, it should be easy to fix by adding an if statement to start at 2 rather than 1 if the first character of the line is a ".

score 2 · Answer 5 · answered May 28 '13 at 14:13

I've scrunched up together a function that re-splits $0 into an array called B. Spaces between double quotes are not acting as field separators. Works with any number of fields, a mix of quoted and unquoted ones. Here goes:

#!/usr/bin/gawk -f

# Resplit $0 into array B. Spaces between double quotes are not separators.
# Single quotes not handled. No escaping of double quotes.
function resplit(       a, l, i, j, b, k, BNF) # all are local variables
{
  l=split($0, a, "\"")
  BNF=0
  delete B
  for (i=1;i<=l;++i)
  {
    if (i % 2)
    {
      k=split(a[i], b)
      for (j=1;j<=k;++j)
        B[++BNF] = b[j]
    }
    else
    {
      B[++BNF] = "\""a[i]"\""
    }
  }
}

{
  resplit()

  for (i=1;i<=length(B);++i)
    print i ": " B[i]
}

Hope it helps.

score 0 · Answer 6 · answered Jan 15 '16 at 21:58

Here is something like what I finally got working that is more generic for my project. Note it doesn't use awk.

someText="ABC \"I am ABC\" 35 DESC '1 23' testing 456"
putItemsInLines() {
    local items=""
    local firstItem="true"
    while test $# -gt 0; do
        if [ "$firstItem" == "true" ]; then
            items="$1"
            firstItem="false"
        else
            items="$items
$1"
        fi
        shift
    done
    echo "$items"
}

count=0
while read -r valueLine; do
    echo "$count: $valueLine"
    count=$(( $count + 1 ))
done <<< "$(eval putItemsInLines $someText)"

Which outputs:

0: ABC
1: I am ABC
2: 35
3: DESC
4: 1 23
5: testing
6: 456

score 0 · Answer 7 · answered Jul 08 '11 at 04:00

0

Okay, if you really want all three fields, you can get them, but it takes a lot of piping:

$ cat data.txt | awk -F\" '{print $1 "," $2 "," $3}' | awk -F' ,' '{print $1 "," $2}' | awk -F', ' '{print $1 "," $2}' | awk -F, '{print $1 "," $2 "," $3}'
ABC,I am ABC,35
DEF,I am not ABC,42

By the last pipe you've got all three fields to do whatever you'd like with.

answered Jul 08 '11 at 04:00

Chris Gregg

2,376
16
30

Actually, there are 4 fields. – DigitalRoss Jul 08 '11 at 04:04
Oops -- I missed that in the original submission. – Chris Gregg Jul 08 '11 at 04:09

Awk consider double quoted string as one token and ignore space in between

7 Answers7

Linked

Related