How to extract xml attributes with (g)awk

Question

So I have this xml line example, which is being read from ,

<element attr1=”XX” attr2=”0818820\.x11” attr3=”YYXX.x11” attr-4=”1”/>

As it is xml, the order of the elements are random and some may be optional.

So with awk I tried to select one of them, say attr1 using gensub.

while (getline < "./file") {
    print $0
    #First attempt
    #print gensub(/.*attr1=\"(.*)\".*/,"\\1","g",$0)
    #Second attempt
    print gensub(/.*attr1="(.*)".*/,"\\1","g",$0)
}

However, I have not managed to match this, but the whole line is returned (probably no match but it can be match all as well). Anyone having an idea? I will not be able to modify the input arguments.

BR Patrik

I haven't used this particular extension for GNU awk but maybe you want to take a look: http://gawkextlib.sourceforge.net/xml/xml.html — James Brown, Jun 28 '19 at 11:36
regex never be a robust and suitable when modifying xml/html data — RomanPerekhrest, Jun 28 '19 at 11:43
@WiktorStribiżew Thanks. Got past the first ". Tried `gensub(/.*vendor=[^"](.*)[^"].*/,"\\1","g",$0)`, but this only removed the first " (not touching the second " did not work and I assume you did not mean it would). I also tried to escape with space and [[:space:]] and ["$] and others for the second ". This did not work either. Any suggestion on this? — patrik, Jun 28 '19 at 11:55
I meant `gensub(/.*attr1="([^"]*)".*/,"\\1","g",$0)` though I doubt `g` makes sense here. — Wiktor Stribiżew, Jun 28 '19 at 11:56
@WiktorStribiżew :(. Looks as if this did not work very well. I still get the full line printed. And yes, "g" does probably not do so much. This was more the autopilot, as you only notice it when things go bad. — patrik, Jun 28 '19 at 12:02
Look, you [just need](https://ideone.com/DWtnTU) `awk '/attr1="/{ print gensub(/.*attr1="([^"]*)".*/,"\\1", 1) }'`. A [sed solution](https://ideone.com/YkVoRM). — Wiktor Stribiżew, Jun 28 '19 at 12:13
@WiktorStribiżew these solutions will only work if the xml line is a single line. — kvantour, Jun 28 '19 at 12:33
Some call it [summoning the daemon](https://www.metafilter.com/86689/), others refer to it as [the Call for Cthulhu](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/) and few just [turned mad and met the Pony](https://stackoverflow.com/a/1732454/8344060). In short, never parse XML or HTML with a regex! Did you try an XML parser such as `xmlstarlet`, `xmllint` or `xsltproc`? — kvantour, Jun 28 '19 at 12:34
@kvantour Exactly, just for the example string. Otherwise, there is no point using `awk` / `sed`, etc. — Wiktor Stribiżew, Jun 28 '19 at 12:34
[You can't parse \[X\]HTML with regex](http://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, e.g.). — Cyrus, Jun 28 '19 at 14:21
@WiktorStribiżew Your solution worked just great. I realized later that the cursivated " symbols were not just a font. — patrik, Jun 28 '19 at 17:10
@WiktorStribiżew Yeah probably, but there are some answers to it now. SO also seems to have some policy against removing posts. Even crappy ones. — patrik, Jul 01 '19 at 10:51
Probably there is a better option, please check those answers. Let me know if you want me to post my solution. — Wiktor Stribiżew, Jul 01 '19 at 10:51

dr-who · Answer 1 · 2019-06-29T04:49:09.977

Assuming the input is in file.txt

$ cat file.txt
<element attr1=”XX” attr2=”0818820\.x11” attr3=”YYXX.x11” attr-4=”1”/>

then use grep to pull out the attributes, then split on the =. As follows:

$  egrep -o "attr[0-9]+[^ ]* " file.txt | awk -F= '{print $1"\t"$2}'
attr1   ”XX” 
attr2   ”0818820\.x11” 
attr3   ”YYXX.x11”

If you only wanted attr1 also select out attr1:

$  egrep -o "attr[0-9]+[^ ]* " file.txt | awk -F= '/attr1/{print $2}'
”XX”

You can tweak the grep line for other attributes. e.g. if you wanted the end attribute, inserting a space makes the logic simpler:

$ sed < file.txt 's|/| |g' | egrep -o "attr[^ ]* "
attr1=”XX” 
attr2=”0818820\.x11” 
attr3=”YYXX.x11” 
attr-4=”1”

score 0 · Answer 2 · answered Jun 29 '19 at 21:52

No reason to reinvent the wheel. gawk-xml documentation mentions several xml parsers for awk, for example Jan Weber’s getXML script (floating around the internets, I found it here). Testing it produced:

$ awk -f getXML.awk test.xml
TAG element
        attr-4=”1”
        attr1=”XX”
        attr2=”0818820\.x11”
        attr3=”YYXX.x11”
END element

How to extract xml attributes with (g)awk

2 Answers2