Extract cell value from html table using bash

Question

I am trying to create a BASH/Perl script which would get a specific value from a dynamic html table.

Here is a sample of my page


<table border="1" bordercolor="#FFCC00" style="background-color:#FFFFCC" width="100%" cellpadding="3" cellspacing="3">

<tr align="center">

<th>Environment</th><th>Release Track</th><th>Artifact</th><th>Name</th><th>Build #</th><th>Cert Idn</th><th>Build Idn</th><th>Request Status</th><th>Update Time</th><th>Log Info.</th><th>Initiator</th>

</tr>

<tr>
<td>DEV03</td><td>2.1.0</td><td>abpa</td><td>ecom-abpa-ear</td><td>204</td><td>82113</td><td>171242</td><td>Deployed</td><td>3/18/2013 3:10:58 PM</td><td width="70">Log info</a></td><td>CESAR</td>
</tr>

<tr>
<td>DEV03</td><td>2.1.0</td><td>abpa</td><td>abpa_dynamic_config_properties</td><td>20</td><td>82113</td><td>167598</td><td>Deployed</td><td>3/18/2013 2:32:27 PM</td><td width="70">Log info</a></td><td>CESAR</td>

</tr>

</table>

My goal is to get this value from this cell.

"Deployed"

Another way to look at it...

Retrieve all data under the "Request Status" column

The value "Deployed" is dynamic and could change.

I have tried the following:

sed -e 's/>/>\n/g' abpa_cesar_status.txt | egrep -i "^\s*[A-Z]+&lt;/td&gt;
" | sed -e 's|&lt;/td&gt;||g' | grep Deployed

But that only greps for "Deployed"

Any ideas?

You mentioned Perl. So use [HTML::TableExtract](http://p3rl.org/HTML::TableExtract). — choroba, Mar 19 '13 at 16:41
**Don't use regular expressions to parse HTML**. You cannot reliably parse HTML with regular expressions, and you will face sorrow and frustration down the road. As soon as the HTML changes from your expectations, your code will be broken. See http://htmlparsing.com/perl for examples of how to properly parse HTML with Perl modules that have already been written, tested and debugged. — Andy Lester, Mar 19 '13 at 16:43
*Really* **don't use regular expressions:** http://stackoverflow.com/a/1732454/140740 — DigitalRoss, Mar 19 '13 at 16:53
your document output is ill-formed, is it normal/excpected or a typo ? Here is a well-formed version : http://pastebin.com/R8RGX1T9 — Édouard Lopez, Mar 19 '13 at 17:26
i can't look at the well-formed versions since my work blocks those links — user2187297, Mar 19 '13 at 20:14
just tested this awk -F "*td>|*tr>" '/<\/*t[td]>.*[A-Z]/ {print $8, $16 }' abpa_cesar_status.txt | grep ecom-abpa-ear — user2187297, Mar 19 '13 at 20:17

score 3 · Answer 1 · answered Mar 19 '13 at 16:46

You should use a parser such as xmllint to do this.

With xmllint you can extract elements based on an xpath.

For example:

$ xmllint --html --format --shell file.html <<< "cat //table/tr/td[position()=8]/text()"
/ >  -------
Deployed
 -------
Deployed
/ >

The xpath //table/tr/td[position()=8]/text(), in the command above, returns the values from the 8th table column.

score 3 · Answer 2 · answered Mar 19 '13 at 17:24

You can also use my Xidel to get everything in the 8-th column:

xidel your_table.html -e '//table//tr/td[8]'

Or if the column position can also change, get the column-number first:

xidel your_table.html -e 'column:=count(//table//th[.="Request Status"]/preceding-sibling::*)+1' -e '//table//tr/td[$column]'

score 2 · Answer 3 · answered Mar 19 '13 at 17:07

2

You can try xsh, a wrapper around XML::LibXML:

open :F html abpa_cesar_status.txt ;
$status = count(//table/tr[1]/th[.="Request Status"]/preceding-sibling::th) ;
ls //td[count(preceding-sibling::td)=$status] ;

In order to use it, you have to make your html a bit more well formed, though (I had to remove </a> to make the script work).

answered Mar 19 '13 at 17:07

choroba

231,213
25
204
289

Look like a nice concept/tools – Édouard Lopez Mar 19 '13 at 17:11

score 2 · Accepted Answer · answered Mar 19 '13 at 17:32

Note that your document output is ill-formed (lack some opening <a>), is it normal/excpected or a typo ? Otherwise, here is a well-formed version.

Command

I like xmlstarlet, simple and straight forward XPath for short tests:

xmlstarlet sel -t -m "//table/tr/td[position()=8]" -v "./text()" -n

Explaination

sel   (or select)        - Select data (mode) or query XML document(s) (XPATH, etc)
-t or --template         - start a template
-m or --match <xpath>    - match XPATH expression
-v or --value-of <xpath> - print value of XPATH expression
-n or --nl               - print new line

Output

Deployed
Deployed
# plus empty-cell

score 0 · Answer 5 · answered Mar 19 '13 at 16:46

Quick and dirty:

cat your_html_file | perl -pe "s/^<\/?table.*$//g;s/^<tr .*$//g;s/<tr> (<td>.*?){8}//g;s/<th.*$//g;s/<\/.*$//g" | sed '/^$/d'

However, this is not how you should do it. Use existing (Perl?) software to parse html and extract your value.

edit: Since you changed your code (added whitespaces), this doesn't work anymore. QED.

Extract cell value from html table using bash

5 Answers5

Command

Explaination

Output