0

I have a html file in which a table has a number of rows.A TR tag may have its corresponding /TR in a another line .For example a.html file has the following.

<TABLE BORDER=1><TR><TH>col1</TH><TH>col2</TH><TH>col3</TH><TH>col4</TH></TR><TR><TD>aaa</TD><TD>bbb</TD><TD>ccc</TD><TD>ddd</TD></TR><TR><TD>eee</TD><TD>fff</TD><TD>ccc</TD><TD>mmm</TD></TR><TR><TD>jjj</TD><TD>kkk</TD><TD>lll</TD><TD>ssss</TD></TR>.........</TABLE>

Now i need to extract the contents between tr and /tr tags(inclusive) into another html file based on the value of td that is found between the tr and /tr.

For example from the a.html file i need to create b.html which only has the rows in which third column value is "ccc",provided a.html remains the same.

<TR><TD>aaa</TD><TD>bbb</TD><TD>ccc</TD><TD>ddd</TD></TR><TR><TD>eee</TD><TD>fff</TD><TD>ccc</TD><TD>mmm</TD></TR>

i am newbie and have only a little idea abt sed and awk. can anyone help me to get this done or suggest a better way so that it can be done easily.

Nishanth
  • 41
  • 6
  • 1
    You can't do this reliable with regex, imagine the next: `.... ... ` Read [this](http://stackoverflow.com/a/1732454/632407). – clt60 Sep 11 '14 at 12:07

3 Answers3

2

Use a proper parser. For example, xsh, a wrapper around Perl's XML::LibXML, which in turn is a wrapper around Gnome libxml2 library:

open :F html file.html ;
ls //tr[td[3]='ccc'] ;
choroba
  • 231,213
  • 25
  • 204
  • 289
  • Yesterday, [you said it was a wrapper around Perl's XML::XSH2](http://stackoverflow.com/a/25763119/2088135) :) – Tom Fenech Sep 11 '14 at 13:06
1

Use Python with BeautifulSoup to do this in a more structured and robust way: Python BeautifulSoup scrape tables - neither sed nor awk can actually parse HTML, and you may as well use something which can.

Here's a working program (Pandas uses BeautifulSoup inside, and it helps me fulfill your desire to not have "for" loops):

import pandas
df = pandas.io.html.read_html('file.html')[0]
html = df[df[2] == 'ccc'].to_html()
print(html)
Community
  • 1
  • 1
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • :iam trying to avoid using for loop in my python script for performance reasons and moreover i need to display this output file in the bowser using another php script . so is that not possible by using pattern matching by sed or awk commands? – Nishanth Sep 11 '14 at 11:56
  • 1
    Performance reasons? You must be joking? How big is your input? – John Zwinck Sep 11 '14 at 11:58
  • This html file will be generated from a mysql table holding thousands of records.So i dont want to use a for loop.may be any other alternatives? – Nishanth Sep 11 '14 at 12:01
  • What does it even mean "I don't want to use a for loop." Do you have a quantum computer handy? – John Zwinck Sep 11 '14 at 12:10
  • @Nishanth: I have now added a working Python script which "does not use for loops." I promise it will be fast. – John Zwinck Sep 11 '14 at 12:23
  • @Nishanth you going to call external `sed/awk/pyhton/perl` process from your `php` and wondering about the loop-speed?! And about of the speed of the fork/exec? – clt60 Sep 11 '14 at 12:31
  • Ahahaha. Thousands of records in a streaming operation is a performance problem? – chrylis -cautiouslyoptimistic- Sep 11 '14 at 12:33
  • 1
    @JohnZwinck I thought importing [`pandas`](http://en.wikipedia.org/wiki/Giant_panda) was illegal!! – jaypal singh Sep 11 '14 at 12:55
1

First i added newlines after each <\tr>

 os.system("sed 's/<\/TR>/&\\\n/g' /tmp/file_full.html > /tmp/file_formated.html")

then executing the following line we get the result.This line checks for the column value to be "ccc" and if so it is wriiten into a seperate file.

os.system('sed -n "/<TD>ccc<\/TD>/p" /tmp/file_formated.html > /tmp/file_ccc.html')
Nishanth
  • 41
  • 6