1

I am trying to extract all lines where a field matches a pattern which is defined as a variable. I tried the following

head input.dat |
awk -F '|' -v CODE="39905|19043" '{print $13; if($13~CODE){print "Matched"} else {print "Nomatch"} }'

I am printing the value of the field before attempting a pattern match.(This way I don't have to show the entire line that contains many fields) This is the output I got.

PLAN_ID
Nomatch
39905
Nomatch
39905
Nomatch
39883
Nomatch
19043
Nomatch
2215
Nomatch
19043
Nomatch
9149
Nomatch
42718
Nomatch
24
Nomatch

I expected to see at least 3 instances of Matched in the output. What am I doing wrong?


edit by @Fravadona

xxd input.dat | head -n 6
00000000: fffe 4d00 4f00 4e00 5400 4800 5f00 4900 ..M.O.N.T.H._.I.
00000010: 4400 7c00 5300 5600 4300 5f00 4400 5400 D.|.S.V.C._.D.T.
00000020: 7c00 5000 4100 5400 4900 4500 4e00 5400 |.P.A.T.I.E.N.T.
00000030: 5f00 4900 4400 7c00 5000 4100 5400 5f00 .I.D.|.P.A.T..
00000040: 5a00 4900 5000 3300 7c00 4300 4c00 4100 Z.I.P.3.|.C.L.A.
00000050: 4900 4d00 5f00 4900 4400 7c00 5300 5600 I.M._.I.D.|.S.V.

Turns out that the input file uses the UTF-16 LE Encoding (as shown by the hexdump of the content). Thus, the solution seems to be to convert the input file from UTF-16LE to UTF-8 before running AWK. Thanks

user5336
  • 161
  • 5
  • 2
    The code works for me (substituting `$13` by `$1`) on the sample provided (4 matches). gawk, mawk, busybox, original-awk – jhnc Jan 27 '23 at 03:15
  • `cat -vet input.dat | head -10` If you see `^M$` at the end of each line, use `dos2unix input.dat` . A quick test, and very often the source of mysterious problems on *nix. Good luck. – shellter Jan 27 '23 at 04:13
  • @shellter Why would CR affect matching here? Also, note that all ten lines of head output have a corresponding message in the sample output provided. – jhnc Jan 27 '23 at 05:42
  • @jhnc CRLF won't matter but the only explanation that I can think of would be the presence of a control char in the field, for ex. `399X\b05` – Fravadona Jan 27 '23 at 07:51
  • @user5336 Can you provide the output of `awk '{print $13}' | xxd`? – Fravadona Jan 27 '23 at 07:52
  • *using the original command* (delete un-needed chars), also check echo "39905|19043" | xxd – Andrew Jan 27 '23 at 11:53
  • @Fravadona, Here's the output: PLAN_ID 39905 39905 39883 19043 2215 19043 9149 42718 24 – user5336 Jan 27 '23 at 14:08
  • @user5336 you din’t pipe it to `xxd` – Fravadona Jan 27 '23 at 14:59
  • 1
    @user5336 and please edit the output of `xxd` into the body of your question. Good luck. – shellter Jan 27 '23 at 15:54
  • [edit] your question to replace the word "pattern" with string-or-regexp, full-or-partial, and line-or-word, whichever you meant. See [how-do-i-find-the-text-that-matches-a-pattern](https://stackoverflow.com/questions/65621325/how-do-i-find-the-text-that-matches-a-pattern). – Ed Morton Jan 27 '23 at 16:02
  • 1
    at this point my preference would be to get an actual copy of the contents of `input.dat`; please update the question with the complete output from either of the following: `head -2 input.dat | xxd` or `head -2 input.dat | base64`; either of these give us the ability to recreate an exact copy of the 1st 2 lines of the file in our systems – markp-fuso Jan 27 '23 at 16:03
  • Sorry @Fravadona. Here is hex dump of the first 80 characters. I suspect the file is in Unicode as opposed to ASCII. 00000000: fffe 4d00 4f00 4e00 5400 4800 5f00 4900 ..M.O.N.T.H._.I. 00000010: 4400 7c00 5300 5600 4300 5f00 4400 5400 D.|.S.V.C._.D.T. 00000020: 7c00 5000 4100 5400 4900 4500 4e00 5400 |.P.A.T.I.E.N.T. 00000030: 5f00 4900 4400 7c00 5000 4100 5400 5f00 _.I.D.|.P.A.T._. 00000040: 5a00 4900 5000 3300 7c00 4300 4c00 4100 Z.I.P.3.|.C.L.A. 00000050: 4900 4d00 5f00 4900 4400 7c00 5300 5600 I.M._.I.D.|.S.V. – user5336 Jan 27 '23 at 16:18
  • @user5336 WOW, you got a `NUL` byte between each letter; if you use `tr -d '\0' < input.dat | awk ...` then it might work – Fravadona Jan 27 '23 at 16:23
  • 1
    Please stop posting information in comments where it can't be formatted and could be missed - [edit] your question to include all relevant information. – Ed Morton Jan 27 '23 at 16:25
  • @Fravadona Yes, I didn't realize that the input file is in UTF-16LE. Hence the additional NUL byte for each character. I think the utility iconv allows you to convert between character-sets. I'll use that & see. Thanks – user5336 Jan 27 '23 at 16:28

1 Answers1

1

I found out (thanks to all who suggested looking at the hexdump of the input file) that the file used UTF-16LE encoding. Once I converted the input file using iconv , the AWK script worked as expected

user5336
  • 161
  • 5