-1

So I have this xml line example, which is being read from ,

<element attr1=”XX” attr2=”0818820\.x11” attr3=”YYXX.x11” attr-4=”1”/>

As it is xml, the order of the elements are random and some may be optional.

So with awk I tried to select one of them, say attr1 using gensub.

while (getline < "./file") {
    print $0
    #First attempt
    #print gensub(/.*attr1=\"(.*)\".*/,"\\1","g",$0)
    #Second attempt
    print gensub(/.*attr1="(.*)".*/,"\\1","g",$0)
}

However, I have not managed to match this, but the whole line is returned (probably no match but it can be match all as well). Anyone having an idea? I will not be able to modify the input arguments.

BR Patrik

Cyrus
  • 84,225
  • 14
  • 89
  • 153
patrik
  • 4,506
  • 6
  • 24
  • 48
  • You need to replace `(.*)` with `([^"]*)` – Wiktor Stribiżew Jun 28 '19 at 11:35
  • I haven't used this particular extension for GNU awk but maybe you want to take a look: http://gawkextlib.sourceforge.net/xml/xml.html – James Brown Jun 28 '19 at 11:36
  • 2
    regex never be a robust and suitable when modifying xml/html data – RomanPerekhrest Jun 28 '19 at 11:43
  • @WiktorStribiżew Thanks. Got past the first ". Tried `gensub(/.*vendor=[^"](.*)[^"].*/,"\\1","g",$0)`, but this only removed the first " (not touching the second " did not work and I assume you did not mean it would). I also tried to escape with space and [[:space:]] and ["$] and others for the second ". This did not work either. Any suggestion on this? – patrik Jun 28 '19 at 11:55
  • I meant `gensub(/.*attr1="([^"]*)".*/,"\\1","g",$0)` though I doubt `g` makes sense here. – Wiktor Stribiżew Jun 28 '19 at 11:56
  • @WiktorStribiżew :(. Looks as if this did not work very well. I still get the full line printed. And yes, "g" does probably not do so much. This was more the autopilot, as you only notice it when things go bad. – patrik Jun 28 '19 at 12:02
  • Look, you [just need](https://ideone.com/DWtnTU) `awk '/attr1="/{ print gensub(/.*attr1="([^"]*)".*/,"\\1", 1) }'`. A [sed solution](https://ideone.com/YkVoRM). – Wiktor Stribiżew Jun 28 '19 at 12:13
  • 1
    @WiktorStribiżew these solutions will only work if the xml line is a single line. – kvantour Jun 28 '19 at 12:33
  • 1
    Some call it [summoning the daemon](https://www.metafilter.com/86689/), others refer to it as [the Call for Cthulhu](https://blog.codinghorror.com/parsing-html-the-cthulhu-way/) and few just [turned mad and met the Pony](https://stackoverflow.com/a/1732454/8344060). In short, never parse XML or HTML with a regex! Did you try an XML parser such as `xmlstarlet`, `xmllint` or `xsltproc`? – kvantour Jun 28 '19 at 12:34
  • 1
    @kvantour Exactly, just for the example string. Otherwise, there is no point using `awk` / `sed`, etc. – Wiktor Stribiżew Jun 28 '19 at 12:34
  • 2
    [You can't parse \[X\]HTML with regex](http://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, e.g.). – Cyrus Jun 28 '19 at 14:21
  • @WiktorStribiżew Your solution worked just great. I realized later that the cursivated " symbols were not just a font. – patrik Jun 28 '19 at 17:10
  • I think you want to remove the question, right? – Wiktor Stribiżew Jun 28 '19 at 19:14
  • @WiktorStribiżew Yeah probably, but there are some answers to it now. SO also seems to have some policy against removing posts. Even crappy ones. – patrik Jul 01 '19 at 10:51
  • Probably there is a better option, please check those answers. Let me know if you want me to post my solution. – Wiktor Stribiżew Jul 01 '19 at 10:51

2 Answers2

0

Assuming the input is in file.txt

$ cat file.txt
<element attr1=”XX” attr2=”0818820\.x11” attr3=”YYXX.x11” attr-4=”1”/>

then use grep to pull out the attributes, then split on the =. As follows:

$  egrep -o "attr[0-9]+[^ ]* " file.txt | awk -F= '{print $1"\t"$2}'
attr1   ”XX” 
attr2   ”0818820\.x11” 
attr3   ”YYXX.x11” 

If you only wanted attr1 also select out attr1:

$  egrep -o "attr[0-9]+[^ ]* " file.txt | awk -F= '/attr1/{print $2}'
”XX” 

You can tweak the grep line for other attributes. e.g. if you wanted the end attribute, inserting a space makes the logic simpler:

$ sed < file.txt 's|/| |g' | egrep -o "attr[^ ]* "
attr1=”XX” 
attr2=”0818820\.x11” 
attr3=”YYXX.x11” 
attr-4=”1” 
dr-who
  • 189
  • 6
0

No reason to reinvent the wheel. gawk-xml documentation mentions several xml parsers for awk, for example Jan Weber’s getXML script (floating around the internets, I found it here). Testing it produced:

$ awk -f getXML.awk test.xml
TAG element
        attr-4=”1”
        attr1=”XX”
        attr2=”0818820\.x11”
        attr3=”YYXX.x11”
END element
James Brown
  • 36,089
  • 7
  • 43
  • 59