Focusing solely on the non-matching regex issue ....
The ~
operator says to process the right side of the operation as a regex. When the right side is a string (or variable containing a string - as in this case), the string is converted to a regex (see GNU awk - Using Dynamic Regexps).
In this case:
text="this is a test (bye)"
awk -F '","' -v text="$text" '$3~text {print $4}' test.csv
The comparison ($3~text
) is converted to:
$3~/this is a test (bye)/
Here the parens are treated as special regex characters and not as literal parens so this is effectively the same as:
$3~/this is a test bye/
Which does not match the data (which contains literal parens).
To match the literal parens we could escape the parens, eg:
$3~/this is a test \(bye\)/
But as the OP has discovered it's not so easy to escape those parens when dealing with a (bash
) variable containing a string (ie, text="this is a test \(bye\)"
).
Another option would be to bracket the parens, eg:
$3~/this is a test [(]bye[)]/
Which can be encompassed in a variable, ie, the following does work:
text="this is a test [(]bye[)]"
awk -F '","' -v text="$text" '$3~text {print $4}' test.csv
The next (bigger) problem then becomes one of how to reformat (bash
) variables with the necessary pairs of brackets; keep in mind that there are other characters that also have special meaning within a regex (eg, .
, *
, [
and ]
).
At this point it starts getting really messy when trying to figure out which characters need to be 'escaped' inside of (bash
) variables.
An easier approach would be to look at a different comparison method that deals with strings (instead of regexes). As mentioned in a comment this is where the index()
function comes in handy.
The index()
function's 2nd argument is processed as a string (and not a regex) so there's no need to worry about some characters (eg, (
and )
) being treated differently/specially. index()
will return a 0
if the 2nd argument is not found, otherwise an integer is returned that indicates the location of the 2nd argument. [NOTE: awk
treats a 0
as false
and any other number as true
]
This means we can keep our original (bash
) variable assignment and instead make a small change to the awk
script:
text="this is a test (bye)" # no change
awk -F '","' -v text="$text" 'index($3,text) {print $4}' test.csv
^^^^^^^^^^^^^^ # replaces '$3~text'
This returns:
Alright"
NOTE: see GNU awk - String Functions for more details on various string functions; pay attention to which arguments are treated as strings vs. regexes
So what about that 2nd piece of code?
text="(hello)"
awk -F '","' -v text="$text" '$2~text {print $4}' test.csv it returns Alright"
awk
treats this like:
$2~/(hello)/`
Which is really:
$2~/hello/`
Net result is that this evaluates as true because it matches on the literal string hello
and (basically) ignores the literal parens in the data.
NOTE: text="(hello)"
/ $1~text
would also evaluate as true in this case.