1

I have an XML document containing a number of XML Processing Instructions which are of the form:

<?cpdoc something?>

I am trying to match them in awk with the pattern

/^\<\?cpdoc/

but it's not returning anything. If I remove the ^ anchor, it works (but I have other similar PIs which don't start a line which I don't want matched).

It looks as if it's being confused by the \<\? but why is it ignoring the line-start anchor?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Peter Flynn
  • 235
  • 2
  • 10
  • Yeah you can't just go escaping random characters and hoping for the best, you have to know which characters are metacharacters and then escape them if you want them treated as literal, otherwise you can turn literal characters INTO metacharacters by escaping them (as you just discovered `<` is literal while `\<` is a word boundary). If you're not sure then put them in a bracket expression instead of escaping them, e.g. `[<]` is still just a literal `<`. – Ed Morton Mar 11 '18 at 15:25
  • Peter Flynn, I moved your solution to a community answer of its own, adding a piece of @EdMorton comment. Feel free to improve it. – Cœur Apr 30 '18 at 11:51
  • Thank you! Having trouble finding it, though. Oh wait, no, it's in my mailbox. Duuh. – Peter Flynn May 01 '18 at 21:41

2 Answers2

1

Don't parse XML with regex, use a proper XML/HTML parser.

theory :

According to the compiling theory, XML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint

xmlstarlet

saxon-lint (my own project)


Check: Using regular expressions with HTML tags


Example using :

xmllint --xpath '//processing-instruction()' file.xml
Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

Solution by OP and explanation by Ed Morton.

It works if the less-than is not escaped, as otherwise it's a word boundary. So instead of:

\<\?

I should use literal:

<\?

This is because we can't just go escaping any character and hoping for the best, we have to know which characters are metacharacters and then escape them if we want them treated as literal.

Cœur
  • 37,241
  • 25
  • 195
  • 267