how to? xmlstarlet to extract HTML data by id

Question

I have a simple task that has me pulling my hair out, i'm sure i'm very close.

here is my xhtml file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>
<title>Test Page</title>
</head>

<body>

<p>
test
</p>

<table id="test_table">
<tr><td>test</td><td>test</td></tr>
<tr><th>mo test</th></tr>
</table>

</body>

</html>

... and xmlstarlet likes it:

$ xmlstarlet.exe el -v test.xhtml
html[@xmlns='http://www.w3.org/1999/xhtml']
html/head
html/head/title
html/body
html/body/p
html/body/table[@id='test_table']
html/body/table/tr
html/body/table/tr/td
html/body/table/tr/td
html/body/table/tr
html/body/table/tr/th

what i need to do is extract the data in the table tag, preferably without the HTML. the context for this is i am writing a test set where a web page is called then written to file. the test requires me to validate the table data but allow the test to succeed if other things on the page change. Also, i will not know in advance how many columns or rows the table will have, it can vary based on the data.

but when i try:

$ xmlstarlet.exe sel -t -c "/html/body/table[@id='test_table']" test.xhtml
Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
None of the XPaths matched; to match a node in the default namespace
use '_' as the prefix (see section 5.1 in the manual).
For instance, use /_:node instead of /node

there are different id's i need for different tests, but they all have unique id values. so, given any 'id' in xhthml, i need it's data.

thanks in advance.

Birei · Accepted Answer · 2014-02-25T17:42:00.920

12

The html data has a default namespace that you have to declare in the xmlstarlet command:

xmlstarlet sel \
    -N n="http://www.w3.org/1999/xhtml" \
    -t \
    -c "/n:html/n:body/n:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

Once located the <table> element I use descendant::*/text() to extract all text elements of it, and also use 2>/dev/null to skip the warning:

Attempt to load network entity http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

It yields:

testtestmo test

UPDATE: I didn't know it but as the error message says, there is no need to declare the namespace when it's the default one, so also this works:

xmlstarlet sel \
    -t \
    -c "/_:html/_:body/_:table[@id='test_table']/descendant::*/text()" \
htmlfile 2>/dev/null

edited Feb 25 '14 at 17:42

answered Feb 25 '14 at 17:36

Birei

35,723
2
77
82

1

Thank you thank you thank you!!! Now i understand the error message as well! but, i also would never have guessed the descendant syntax. – matt stucky Feb 25 '14 at 17:56
1

For _years_ I'd been running HTML through `tidy -q -asxml`, which produces namespaced (X)HTML, and wondering why `xmlstarlet sel` wouldn't produce any results, given a perfectly reasonable XPath expression like `//title` that worked fine with other tools. You have opened my mind now to the existence and possibility of namespaces (!), and the tip about using `_` for the default one is solid gold. ¡Muchas gracias! – TheDudeAbides Feb 06 '20 at 02:16

score 1 · Answer 2 · answered Jan 05 '18 at 17:11

As is mentioned in

http://xmlstar.sourceforge.net/doc/UG/ch05.html

common problems when using the

-N x="http://www.w3.org/1999/xhtml" \

option you also have to prefix the node selections with

x:

e.g.

 xmlstarlet sel \
  -N x="http://www.w3.org/1999/xhtml" \
  -t \
  -m "//x:pre" \
  -v . somehtml.html

will select all pre nodes

score -1 · Answer 3 · edited Jan 31 '18 at 05:11

-1

You can try

xmlstarlet ed --inplace -u "html/body/table[@id='your_tabl e_id']/tr[@id='row_id']/td[@id='data_id']" -v NEW_VALUE_TO_BE_CHANGED HTMLFILE_NAME 2>/dev/null

edited Jan 31 '18 at 05:11

Sunil

3,404
10
23
31

answered Jan 31 '18 at 04:16

Prasad Tamgale

325
1
12

how to? xmlstarlet to extract HTML data by id

3 Answers3