Simple Pattern match with a field and a variable does not seem to work in GAWK/AWK

Question

I am trying to extract all lines where a field matches a pattern which is defined as a variable. I tried the following

head input.dat |
awk -F '|' -v CODE="39905|19043" '{print $13; if($13~CODE){print "Matched"} else {print "Nomatch"} }'

I am printing the value of the field before attempting a pattern match.(This way I don't have to show the entire line that contains many fields) This is the output I got.

PLAN_ID
Nomatch
39905
Nomatch
39905
Nomatch
39883
Nomatch
19043
Nomatch
2215
Nomatch
19043
Nomatch
9149
Nomatch
42718
Nomatch
24
Nomatch

I expected to see at least 3 instances of Matched in the output. What am I doing wrong?

_{edit by @Fravadona}

xxd input.dat | head -n 6

00000000: fffe 4d00 4f00 4e00 5400 4800 5f00 4900 ..M.O.N.T.H._.I.
00000010: 4400 7c00 5300 5600 4300 5f00 4400 5400 D.|.S.V.C._.D.T.
00000020: 7c00 5000 4100 5400 4900 4500 4e00 5400 |.P.A.T.I.E.N.T.
00000030: 5f00 4900 4400 7c00 5000 4100 5400 5f00 .I.D.|.P.A.T..
00000040: 5a00 4900 5000 3300 7c00 4300 4c00 4100 Z.I.P.3.|.C.L.A.
00000050: 4900 4d00 5f00 4900 4400 7c00 5300 5600 I.M._.I.D.|.S.V.

Turns out that the input file uses the UTF-16 LE Encoding (as shown by the hexdump of the content). Thus, the solution seems to be to convert the input file from UTF-16LE to UTF-8 before running AWK. Thanks

The code works for me (substituting `$13` by `$1`) on the sample provided (4 matches). gawk, mawk, busybox, original-awk — jhnc, Jan 27 '23 at 03:15
`cat -vet input.dat | head -10` If you see `^M$` at the end of each line, use `dos2unix input.dat` . A quick test, and very often the source of mysterious problems on *nix. Good luck. — shellter, Jan 27 '23 at 04:13
@shellter Why would CR affect matching here? Also, note that all ten lines of head output have a corresponding message in the sample output provided. — jhnc, Jan 27 '23 at 05:42
@jhnc CRLF won't matter but the only explanation that I can think of would be the presence of a control char in the field, for ex. `399X\b05` — Fravadona, Jan 27 '23 at 07:51
@user5336 Can you provide the output of `awk '{print $13}' | xxd`? — Fravadona, Jan 27 '23 at 07:52
*using the original command* (delete un-needed chars), also check echo "39905|19043" | xxd — Andrew, Jan 27 '23 at 11:53
@Fravadona, Here's the output: PLAN_ID 39905 39905 39883 19043 2215 19043 9149 42718 24 — user5336, Jan 27 '23 at 14:08
@user5336 and please edit the output of `xxd` into the body of your question. Good luck. — shellter, Jan 27 '23 at 15:54
[edit] your question to replace the word "pattern" with string-or-regexp, full-or-partial, and line-or-word, whichever you meant. See [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern). — Ed Morton, Jan 27 '23 at 16:02
at this point my preference would be to get an actual copy of the contents of `input.dat`; please update the question with the complete output from either of the following: `head -2 input.dat | xxd` or `head -2 input.dat | base64`; either of these give us the ability to recreate an exact copy of the 1st 2 lines of the file in our systems — markp-fuso, Jan 27 '23 at 16:03
Sorry @Fravadona. Here is hex dump of the first 80 characters. I suspect the file is in Unicode as opposed to ASCII. 00000000: fffe 4d00 4f00 4e00 5400 4800 5f00 4900 ..M.O.N.T.H._.I. 00000010: 4400 7c00 5300 5600 4300 5f00 4400 5400 D.|.S.V.C._.D.T. 00000020: 7c00 5000 4100 5400 4900 4500 4e00 5400 |.P.A.T.I.E.N.T. 00000030: 5f00 4900 4400 7c00 5000 4100 5400 5f00 _.I.D.|.P.A.T._. 00000040: 5a00 4900 5000 3300 7c00 4300 4c00 4100 Z.I.P.3.|.C.L.A. 00000050: 4900 4d00 5f00 4900 4400 7c00 5300 5600 I.M._.I.D.|.S.V. — user5336, Jan 27 '23 at 16:18
@user5336 WOW, you got a `NUL` byte between each letter; if you use `tr -d '\0' < input.dat | awk ...` then it might work — Fravadona, Jan 27 '23 at 16:23
Please stop posting information in comments where it can't be formatted and could be missed - [edit] your question to include all relevant information. — Ed Morton, Jan 27 '23 at 16:25
@Fravadona Yes, I didn't realize that the input file is in UTF-16LE. Hence the additional NUL byte for each character. I think the utility iconv allows you to convert between character-sets. I'll use that & see. Thanks — user5336, Jan 27 '23 at 16:28

score 1 · Answer 1 · answered Jan 29 '23 at 21:16

1

I found out (thanks to all who suggested looking at the hexdump of the input file) that the file used UTF-16LE encoding. Once I converted the input file using iconv , the AWK script worked as expected

answered Jan 29 '23 at 21:16

user5336

161
5

Simple Pattern match with a field and a variable does not seem to work in GAWK/AWK

1 Answers1