extracting particular rows from a html file based on the column value using sed or awk

Question

I have a html file in which a table has a number of rows.A TR tag may have its corresponding /TR in a another line .For example a.html file has the following.

<TABLE BORDER=1><TR><TH>col1</TH><TH>col2</TH><TH>col3</TH><TH>col4</TH></TR><TR><TD>aaa</TD><TD>bbb</TD><TD>ccc</TD><TD>ddd</TD></TR><TR><TD>eee</TD><TD>fff</TD><TD>ccc</TD><TD>mmm</TD></TR><TR><TD>jjj</TD><TD>kkk</TD><TD>lll</TD><TD>ssss</TD></TR>.........</TABLE>

Now i need to extract the contents between tr and /tr tags(inclusive) into another html file based on the value of td that is found between the tr and /tr.

For example from the a.html file i need to create b.html which only has the rows in which third column value is "ccc",provided a.html remains the same.

<TR><TD>aaa</TD><TD>bbb</TD><TD>ccc</TD><TD>ddd</TD></TR><TR><TD>eee</TD><TD>fff</TD><TD>ccc</TD><TD>mmm</TD></TR>

i am newbie and have only a little idea abt sed and awk. can anyone help me to get this done or suggest a better way so that it can be done easily.

You can't do this reliable with regex, imagine the next: `.... ... ` Read [this](http://stackoverflow.com/a/1732454/632407). — clt60, Sep 11 '14 at 12:07

choroba · Answer 1 · 2014-09-11T13:11:16.697

2

Use a proper parser. For example, xsh, a wrapper around Perl's XML::LibXML, which in turn is a wrapper around Gnome libxml2 library:

open :F html file.html ;
ls //tr[td[3]='ccc'] ;

edited Sep 11 '14 at 13:11

answered Sep 11 '14 at 12:07

choroba

231,213
25
204
289

Yesterday, [you said it was a wrapper around Perl's XML::XSH2](http://stackoverflow.com/a/25763119/2088135) :) – Tom Fenech Sep 11 '14 at 13:06

score 1 · Answer 2 · edited May 23 '17 at 11:49

1

Use Python with BeautifulSoup to do this in a more structured and robust way: Python BeautifulSoup scrape tables - neither sed nor awk can actually parse HTML, and you may as well use something which can.

Here's a working program (Pandas uses BeautifulSoup inside, and it helps me fulfill your desire to not have "for" loops):

import pandas
df = pandas.io.html.read_html('file.html')[0]
html = df[df[2] == 'ccc'].to_html()
print(html)

edited May 23 '17 at 11:49

Community

1
1

answered Sep 11 '14 at 11:48

John Zwinck

239,568
38
324
436

:iam trying to avoid using for loop in my python script for performance reasons and moreover i need to display this output file in the bowser using another php script . so is that not possible by using pattern matching by sed or awk commands? – Nishanth Sep 11 '14 at 11:56
1

Performance reasons? You must be joking? How big is your input? – John Zwinck Sep 11 '14 at 11:58
This html file will be generated from a mysql table holding thousands of records.So i dont want to use a for loop.may be any other alternatives? – Nishanth Sep 11 '14 at 12:01
What does it even mean "I don't want to use a for loop." Do you have a quantum computer handy? – John Zwinck Sep 11 '14 at 12:10
@Nishanth: I have now added a working Python script which "does not use for loops." I promise it will be fast. – John Zwinck Sep 11 '14 at 12:23
@Nishanth you going to call external `sed/awk/pyhton/perl` process from your `php` and wondering about the loop-speed?! And about of the speed of the fork/exec? – clt60 Sep 11 '14 at 12:31
Ahahaha. Thousands of records in a streaming operation is a performance problem? – chrylis -cautiouslyoptimistic- Sep 11 '14 at 12:33
1

@JohnZwinck I thought importing [`pandas`](http://en.wikipedia.org/wiki/Giant_panda) was illegal!! – jaypal singh Sep 11 '14 at 12:55

score 1 · Accepted Answer · answered Oct 13 '14 at 07:44

First i added newlines after each <\tr>

 os.system("sed 's/<\/TR>/&\\\n/g' /tmp/file_full.html > /tmp/file_formated.html")

then executing the following line we get the result.This line checks for the column value to be "ccc" and if so it is wriiten into a seperate file.

os.system('sed -n "/<TD>ccc<\/TD>/p" /tmp/file_formated.html > /tmp/file_ccc.html')

extracting particular rows from a html file based on the column value using sed or awk

3 Answers3