How can to check for first pattern on the line with awk?

Question

I have a confusing (to me) example here. This awk expression gives the desired result and prints: "match"

$ echo -e "<?xml version="1.1" encoding="UTF-8" standalone="no"?>\n<databaseChangeLog" |
  awk  -e'/[[:space:]]*<?xml /{ print "match"; } { quit 0; }'
match
$

We actually want things so that any match is the first pattern on the line. This should be the beginning of string/line anchor,^ as far as I know. And yet adding ^ fails as shown:

$ echo -e "<?xml version="1.1" encoding="UTF-8" standalone="no"?>\n<databaseChangeLog" |
  awk  -e'/^[[:space:]]*<?xml /{ print "match"; } { quit 0; }'
$ 
$ # NO match

Using gawk, version:

$ awk --version
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
Copyright (C) 1989, 1991-2019 Free Software Foundation.
  ...

What's missing?

First thing first you need not to have `-e` option before running `awk` program. Then I have tested both of your `awk` programs(with and without `-e` option) and I am getting match in both of programs output. I am using gawk with 5.0.1 version, which awk version you are using, though I don't think awk version could be an issue but asking in case I have it, could try to replicate it once, cheers. — RavinderSingh13, Oct 21 '21 at 04:38
[Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). — Cyrus, Oct 21 '21 at 04:50
Add `^` to your regular expression, not to the input string. To match 0 or more spaces followed by ` — Renaud Pacalet, Oct 21 '21 at 04:51
@Cyrus I do agree that regex is wrong tool for parsing XML, but if intention is only detection (i.e. does it looks like XML file?) then regex might be enough for that — Daweo, Oct 21 '21 at 08:07
@RenaudPacalet ... Thank you, yes a confusing typo. Sorry folk. — will, Oct 25 '21 at 14:01
@Cyrus ... Yes, not parsing XML, just want to know why this example fails. — will, Oct 25 '21 at 14:02

Ed Morton · Accepted Answer · 2021-10-21T16:11:59.853

You added the ^ to your input instead of adding it to the regexp in your code that's supposed to match the input, i.e. you did:

$ echo '^foobar' | awk '/bar/'
^foobar

Instead of:

$ echo 'foobar' | awk '/^bar/'
$

You're also using a ? regexp metachar but want a literal ? instead and you're trying to use a non-existent keyword quit when I assume you mean exit (so what your code actually does is concatenate an undefined variable with the number 0 resulting in the string 0 which you then just discard) but you only exit with 0 which is the default anyway so that's all redundant.

I think this might be what you're trying to do:

awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ if (f) print "match"; exit !f }'

e.g.:

$ printf '%s\n' '<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ if (f) print "match"; exit !f }'
match
$ echo $?
0

$ printf '%s\n' 'foo<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ if (f) print "match"; exit !f }'
$ echo $?
1

The above will work in any POSIX awk. If you have a very old awk that doesn't support POSIX character classes then just change [[:space:]] to [ \t] and that will work in any awk.

Consider also printing match or no match to stderr:

$ printf '%s\n' '<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ print (f ? "" : "no ") "match" | "cat>&2"; exit !f }'
match

$ printf '%s\n' 'foo<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ print (f ? "" : "no ") "match" | "cat>&2"; exit !f }'
no match

Unfortunately the garet(`^`) was misplaced in the command. I ought remove the question. It is probably a bit confusing now. — will, Oct 25 '21 at 14:05

How can to check for first pattern on the line with awk?

1 Answers1