1

I have a confusing (to me) example here. This awk expression gives the desired result and prints: "match"

$ echo -e "<?xml version="1.1" encoding="UTF-8" standalone="no"?>\n<databaseChangeLog" |
  awk  -e'/[[:space:]]*<?xml /{ print "match"; } { quit 0; }'
match
$ 

We actually want things so that any match is the first pattern on the line. This should be the beginning of string/line anchor,^ as far as I know. And yet adding ^ fails as shown:

$ echo -e "<?xml version="1.1" encoding="UTF-8" standalone="no"?>\n<databaseChangeLog" |
  awk  -e'/^[[:space:]]*<?xml /{ print "match"; } { quit 0; }'
$ 
$ # NO match

Using gawk, version:

$ awk --version
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
Copyright (C) 1989, 1991-2019 Free Software Foundation.
  ... 

What's missing?

will
  • 4,799
  • 8
  • 54
  • 90
  • First thing first you need not to have `-e` option before running `awk` program. Then I have tested both of your `awk` programs(with and without `-e` option) and I am getting match in both of programs output. I am using gawk with 5.0.1 version, which awk version you are using, though I don't think awk version could be an issue but asking in case I have it, could try to replicate it once, cheers. – RavinderSingh13 Oct 21 '21 at 04:38
  • [Don't Parse XML/HTML With Regex.](https://stackoverflow.com/a/1732454/3776858) I suggest to use an XML/HTML parser (xmlstarlet, xmllint ...). – Cyrus Oct 21 '21 at 04:50
  • 1
    Add `^` to your regular expression, not to the input string. To match 0 or more spaces followed by ` – Renaud Pacalet Oct 21 '21 at 04:51
  • @Cyrus I do agree that regex is wrong tool for parsing XML, but if intention is only detection (i.e. does it looks like XML file?) then regex might be enough for that – Daweo Oct 21 '21 at 08:07
  • @RenaudPacalet ... Thank you, yes a confusing typo. Sorry folk. – will Oct 25 '21 at 14:01
  • @Cyrus ... Yes, not parsing XML, just want to know why this example fails. – will Oct 25 '21 at 14:02

1 Answers1

1

You added the ^ to your input instead of adding it to the regexp in your code that's supposed to match the input, i.e. you did:

$ echo '^foobar' | awk '/bar/'
^foobar

Instead of:

$ echo 'foobar' | awk '/^bar/'
$

You're also using a ? regexp metachar but want a literal ? instead and you're trying to use a non-existent keyword quit when I assume you mean exit (so what your code actually does is concatenate an undefined variable with the number 0 resulting in the string 0 which you then just discard) but you only exit with 0 which is the default anyway so that's all redundant.

I think this might be what you're trying to do:

awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ if (f) print "match"; exit !f }'

e.g.:

$ printf '%s\n' '<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ if (f) print "match"; exit !f }'
match
$ echo $?
0

$ printf '%s\n' 'foo<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ if (f) print "match"; exit !f }'
$ echo $?
1

The above will work in any POSIX awk. If you have a very old awk that doesn't support POSIX character classes then just change [[:space:]] to [ \t] and that will work in any awk.

Consider also printing match or no match to stderr:

$ printf '%s\n' '<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ print (f ? "" : "no ") "match" | "cat>&2"; exit !f }'
match

$ printf '%s\n' 'foo<?xml version="1.1" encoding="UTF-8" standalone="no"?>' '<databaseChangeLog' |
    awk '/^[[:space:]]*<\?xml /{ f=1; exit } END{ print (f ? "" : "no ") "match" | "cat>&2"; exit !f }'
no match
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • Unfortunately the garet(`^`) was misplaced in the command. I ought remove the question. It is probably a bit confusing now. – will Oct 25 '21 at 14:05