0

I've an HTML code like this

<dl>
 <dt><a href="element1" id="element1">element1</a> Version 1 </dt>
 <dd>Description 1</dd>
 <dt><a href="element2" id="element2">element2</a> Version 2 </dt>
 <dd>Description 2</dd>
...
</dl>

And i would like printing an output like

Item: element1, Version: Version1, Description: Description 1
Item: element2, Version: Version2, Description: Description 2
...

I tried in several ways but my best aproach is:

xmllint --xpath "concat('Item: ', //dl/dt/a/text(),', Version: ',', Description: ',//dl/dd/text())" file

#output
Item: element1, Version: , Description: Description 1

Problems:

  • cannot get versions
  • cannot get all elements
JSerrahima
  • 17
  • 3

3 Answers3

0

You can use htql. For your example:

text="""<dl>
 <dt><a href="element1" id="element1">element1</a> Version 1 </dt>
 <dd>Description 1</dd>
 <dt><a href="element2" id="element2">element2</a> Version 2 </dt>
 <dd>Description 2</dd>
...
</dl>"""

import htql
results = htql.query(text, "<dl>.<dt sep>2-0 {Item=<a>:tx; Version=<a>:xx; Description=<dd>:tx }")

Then show results:

>>> results
[('element1', ' Version 1 ', 'Description 1'), ('element2', ' Version 2 ', 'Description 2')]
seagulf
  • 380
  • 3
  • 5
-1

If you don't have to stick with xmllint, here is a pure bash way to get the job done:

cat file | tr '>' '\n' | grep '.\+</' | cut -d '<' -f 1 | awk '{ if (NR%3==1) print "Item: "$0","; if (NR%3==2) print "Version: "$0","; if (NR%3==0) print "Description: "$0;}' | paste -sd '  \n' -

Explanation:

1st part of pipe: Extract data of interested

cat file | tr '>' '\n' | grep '.\+</' | cut -d '<' -f 1

This outputs:

element1
Version 1
Description 1
element2
Version 2
Description 2

2nd part of pipe: Prefix name based on line number

awk '{ if (NR%3==1) print "Item: "$0","; if (NR%3==2) print "Version: "$0","; if (NR%3==0) print "Description: "$0;}'

This outputs:

Item: element1,
Version:  Version 1 ,
Description: Description 1
Item: element2,
Version:  Version 2 ,
Description: Description 2

Final part of pipe: Stitch every 3 lines

paste -sd '  \n' -

This outputs the final result you want.

Taylor G.
  • 661
  • 3
  • 10
  • 2
    [Please don't parse XML/HTML with regex or similar.](https://stackoverflow.com/a/1732454/3776858) – Cyrus Apr 17 '21 at 19:37
-1

Following @seagulf recomendation it's easier with python,

results = htql.query(mystr, "<dl>.<dt sep>2-0 {Item=<a>:tx; Version=<a>:xx; Description=<dd>:tx } \n")
for x in results:
    f.write ('{"item": "'+ x[0] + '", "version" : "' + x[1] + '", "description" : "' + x[2] + '"},\n')

#output
{"item": "element 1", "version" : "version 1", "description" : "description 1"},
{"item": "element 2", "version" : "version 2", "description" : "description 2"},
...

Thank you so much!

JSerrahima
  • 17
  • 3