using awk to find special charecters in txt file

Question

I need to scan a file with many different special charecters and values. Given a set of special charecters - I need to provide the value next to it:

547 %$ 
236 \"
4523 &* 
8876 (*
8756 "/
...

I am using an awk command with gsub in order to find the sequences as they are.

awk -v st="$match_string" 'BEGIN {gsub(/(\[|\]|\-|\$|\*|\:|\+|\"|\(|\))/,"\\\\&", st)} match($0,st) {print;exit}' file.txt

The command works great e.g.

> (*
>> 8876 (*

However I am having trouble using the command to locate the \" sequence I am trying to add to the gsub different strings to represnt the sequence:

|\\|
|\\\\|
|\\\\"|
...

But the result is always:

> \"
>> 8756 "/

while the result I am looking for woould be:

> \"
>> 236 \"

It seems that the gsub does not work, and the \" is interpeted just as " Any ideas?

follwoing is a short script to run - - it should find the symbol attached to the value in first_num - Next it should print the first value in the file attched to the symbol found

first_num=$1
echo "looking for : $first_num"
sym_to_check=$(awk -v s="$first_num"  '$0~s {if ($0~s)print $2}' temp.txt)
echo "symbol - $sym_to_check"
first_val=$(awk -v s="$sym_to_check" 'BEGIN {gsub(/(\[|\]|\-|\$|\^|\*|\:|\+|\"|\(|\))/,"\\\\&",s)} $0~s {if ($0~s)print; if ($0~s)exit}' temp.txt)
echo "first val- $first_val"

suppose the txt file is:

547 %$ 
111 [*
222 ()
5655 (*
454 )"
35 #!
743 \"
657 #!
236 \"
4523 &* 
8876 (*
456 \"
8756 "/

first run is good:

> bash temp1.sh 8876
    looking for : 8876
    symbol - (*
    first val- 5655 (*

the script finds the first value attached to (* but the next run is bad:

> bash temp1.sh 236
looking for : 236
symbol - \"
first val- 454 )"

the symbol is correct - looking for \" but when searching for the first value attached to it, it looks for the first symbol with " This gives the value 454 )" instead of the desired 743 \"

What inputs are working and what is not? Provide the inputs you are testing on and expected output for that — Inian, Apr 27 '20 at 10:05
sequences such as \" are not working - they are translated into " — tom, Apr 27 '20 at 10:20
Okay we got it, provide an example that we can copy-paste easily and work on — oguz ismail, Apr 27 '20 at 10:22

Ed Morton · Answer 1 · 2020-04-27T12:06:56.693

The way you're initializing the awk variable st using -v st="$match_string" is by design expanding escape sequences (so \t in "$match_string" would become a literal tab char in st, for example) and you're using a regexp operator, match(), but trying to escape the regexp metachars to make it act like it's doing string instead of regexp matching and then you're doing partial matching on the whole line (e.g. $0~85 would match 1853) instead of full matching on a specific field ($1==85).

Here's how you init awk variables from the shell without interpreting escape sequences and then test for them as full-matching literal strings or numbers on a specific field rather than partial-matching regexps across the whole line:

$ match_string='\"'

$ st="$match_string" awk 'BEGIN{st=ENVIRON["st"]} $2==st{print; exit}' file
743 \"

$ awk 'BEGIN{st=ARGV[1]; ARGV[1]=""} $2==st{print; exit}' "$match_string" file
743 \"

$ awk 'BEGIN{st=ARGV[1]; ARGV[1]=""} $1==st{print; exit}' '743' file
743 \"

Not all awks support ENVIRON[] so the first approach won't work in all awks but the second will.

See How do I use shell variables in an awk script? for how to set awk variables from shell and when you want to do literal string comparisons, it's usually simpler to just use string operators like == and index() instead of using regexp operators like ~ or match() and trying to escape all the regexp metacharacters to make them act like they're strings.

If you ever DID want to escape all regexp metachars, though, then the syntax to do that would be:

gsub(/[^^]/,"[&]",st); gsub(/\^/,"\\^",st)

rather than what you have in the code in your question:

gsub(/(\[|\]|\-|\$|\*|\:|\+|\"|\(|\))/,"\\\\&", st)

See Is it possible to escape regex metacharacters reliably with sed for an explanation of why that is the correct syntax.

Thanks - the st="$match_string" awk 'BEGIN{st=ENVIRON["st"]} $2==st{print; exit}' file solution works great !! — tom, Apr 27 '20 at 13:56

using awk to find special charecters in txt file

1 Answers1