awk not finding a string value with special characters in a csv file

Question

I'm just trying to find a way to check if a string value exists in a csv column using awk command. It works fine except for the strings which contains special characters like ( or [.

It worked for text with no special chars. Then when I tried searching text with special chars it hasn't worked, so I tried to escape these chars but it hasn't worked neither.

So I got a test.csv file which contains a row like:

"hello","(hello)","this is a test (bye)","Alright"

then if I try searching the first field like:

text="hello"; awk -F '","' -v text="$text" '$1~text {print $4}' test.csv

it returns Alright", which is fine.

then if I try searching the second field like:

text="(hello)"; awk -F '","' -v text="$text" '$2~text {print $4}' test.csv

it returns Alright", which is fine too.

then if I try searching the third field like:

text="this is a test (bye)"; awk -F '","' -v text="$text" '$3~text {print $4}' test.csv

it returns nothing.

then if I try escaping special chars like:

text="this is a test \(bye\)"; awk -F '","' -v text="$text" '$3~text {print $4}' test.csv

it returns a message like:

awk: warning: the escape sequence '\(' is treated as a simple "("
awk: warning: the escape sequence '\)' is treated as a simple ")"

with no results as previous.

If you want to search for a literal string then prefer `index($3,text)` instead of `$3 ~ text`; and the problem with your regex-escaping is because the backslashes are processed when used in an assignment (`-v`) — Fravadona, May 23 '23 at 13:16
For a CSV with quoted values, I would reach for a special-purpose CSV-parsing library rather than `awk`. — Mark Reed, May 23 '23 at 15:03

markp-fuso · Accepted Answer · 2023-05-23T15:58:23.397

Focusing solely on the non-matching regex issue ....

The ~ operator says to process the right side of the operation as a regex. When the right side is a string (or variable containing a string - as in this case), the string is converted to a regex (see GNU awk - Using Dynamic Regexps).

In this case:

text="this is a test (bye)"
awk -F '","' -v text="$text" '$3~text {print $4}' test.csv

The comparison ($3~text) is converted to:

$3~/this is a test (bye)/

Here the parens are treated as special regex characters and not as literal parens so this is effectively the same as:

$3~/this is a test bye/

Which does not match the data (which contains literal parens).

To match the literal parens we could escape the parens, eg:

$3~/this is a test \(bye\)/

But as the OP has discovered it's not so easy to escape those parens when dealing with a (bash) variable containing a string (ie, text="this is a test $bye$").

Another option would be to bracket the parens, eg:

$3~/this is a test [(]bye[)]/

Which can be encompassed in a variable, ie, the following does work:

text="this is a test [(]bye[)]"
awk -F '","' -v text="$text" '$3~text {print $4}' test.csv

The next (bigger) problem then becomes one of how to reformat (bash) variables with the necessary pairs of brackets; keep in mind that there are other characters that also have special meaning within a regex (eg, ., *, [ and ]).

At this point it starts getting really messy when trying to figure out which characters need to be 'escaped' inside of (bash) variables.

An easier approach would be to look at a different comparison method that deals with strings (instead of regexes). As mentioned in a comment this is where the index() function comes in handy.

The index() function's 2nd argument is processed as a string (and not a regex) so there's no need to worry about some characters (eg, ( and )) being treated differently/specially. index() will return a 0 if the 2nd argument is not found, otherwise an integer is returned that indicates the location of the 2nd argument. [NOTE: awk treats a 0 as false and any other number as true]

This means we can keep our original (bash) variable assignment and instead make a small change to the awk script:

text="this is a test (bye)"                                        # no change
awk -F '","' -v text="$text" 'index($3,text) {print $4}' test.csv
                              ^^^^^^^^^^^^^^                       # replaces '$3~text'

This returns:

Alright"

NOTE: see GNU awk - String Functions for more details on various string functions; pay attention to which arguments are treated as strings vs. regexes

So what about that 2nd piece of code?

text="(hello)"
awk -F '","' -v text="$text" '$2~text {print $4}' test.csv it returns Alright"

awk treats this like:

$2~/(hello)/`

Which is really:

$2~/hello/`

Net result is that this evaluates as true because it matches on the literal string hello and (basically) ignores the literal parens in the data.

NOTE: text="(hello)" / $1~text would also evaluate as true in this case.

score 2 · Answer 2 · edited May 24 '23 at 12:37

With -F '","' your first field value is "hello, not hello or "hello". That, I think, is what's causing you to think you need to do regexp instead of string comparisons but as you're discovering that's the wrong solution. Using -F "," not only leads to your current problems but it's fragile since it'll fail given input like "head","foo"",""bar","tail" where the middle field "foo"",""bar" contains nested escaped quotes with a comma between them.

Using GNU awk for FPAT:

$ awk -v FPAT='([^,]*)|("([^"]|"")*")' -v text='hello' '
    $1 == ("\"" text "\"") { print $4 }
' test.csv
"Alright"

$ awk -v FPAT='([^,]*)|("([^"]|"")*")' -v text='this is a test (bye)' '
    $3 == ("\"" text "\"") { print $4 }
' test.csv
"Alright"

You can do that with any awk but you need to write a bit more code:

$ awk -v fpat='([^,]*)|("([^"]|"")*")' -v OFS=',' -v text='this is a test (bye)' '
    {
        tail = $0
        $0 = ""
        while ( (tail != "") && match(tail,fpat) ) {
            $(NF+1) = substr(tail,1,RLENGTH)
            tail = substr(tail,RLENGTH+2)
        }
    }
    $3 == ("\"" text "\"") { print $4 }
' test.csv
"Alright"

See What's the most robust way to efficiently parse CSV using awk? for more information on reading CSV with awk.

If your text string can contain backslashes then see How do I use shell variables in an awk script? for other ways than -v (which interprets escape sequences) to assign your awk variable a value outside of the script.

awk not finding a string value with special characters in a csv file

2 Answers2