python - How to transform a html file into a human-readable txt file?

Question

I have a lot of html files look like this:

<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>

what I want to do is taking out the text in the middle of the file and transform it into a human-readable format. in this example, it is:

According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.

On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.

I know I have to do 3 things, they are:

take out the text in the middle of the file
replace "<br />" with "\n"
replace " " with " " (one space)

I know the latter 2 things are easy, just using the replace method in Python, but I don't know how to achieve the first goal.

I know regular expression and BeautifulSoup a little, but I don't know how to apply them to this question.

Can someone help me?

Thanks, and I'm sorry for my poor English.

@Paul: I want just a section which is the summary. My teacher (who doesn't know much about computers) gives me a lot of html files and asks me to transform them into a format which is proper for data mining (My teacher try to use SAS to do this). I don't know SAS, but I think it may used to handle a lot of txt files, so I want to transform these html files into normal txt files.

@Owen: I need to handle a lot of html files and I think this problem isn't too difficult to handle, so I want to solve it directly with Python.

Have you tried using a text only web browser, such as lynx? Or do you want just one section, such as the summary? — Paul, Aug 23 '11 at 03:10
Do you have to do this in Python? I've found `pandoc` (http://johnmacfarlane.net/pandoc/) gives reasonably good output for this case. — Owen, Aug 23 '11 at 03:12
Possible duplicate? http://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python — Umang, Aug 23 '11 at 03:25

score 3 · Answer 1 · answered Aug 23 '11 at 06:45

3

You can use Scrapely.

Scrapely is a library for extracting structured data from HTML pages. Given some example web pages and the data to be extracted, scrapely constructs a parser for all similar pages.

http://github.com/scrapy/scrapely

answered Aug 23 '11 at 06:45

pricco

2,763
1
21
22

score 2 · Accepted Answer · answered Aug 23 '11 at 03:24

To accomplish this task, you can use the help of a Python library called Lxml.

First, download and install Lxml.

Now try running the following code:

from lxml.html import fromstring

html = '''
<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
 Summary:
</b>
&nbsp;According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations.  On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed.  The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
 SIC Code:
</b>
0000
<br />
<b>
 Sector:
</b>
N/A
<br />
<b>
 Industry:
</b>
N/A
<br />
</font>
'''

htmlElement = fromstring(html)
textContent = htmlElement.text_content()
result = textContent.split('\n\n Summary:\n\n')[1].split('\n\nINDUSTRY CLASSIFICATION:\n\n')[0]

print result

This code will work if '\n\n Summary:\n\n' comes before the desired text and '\n\n INDUSTRY CLASSIFICATION:\n\n' comes after the desired text.

score 1 · Answer 3 · answered Aug 23 '11 at 04:12

Nearest one would be convert HTML to reStructureText, you can try online here, which output following.

 **Summary:** According to the complaint filed January 04, 2011, over a
six-week period in December 2007 and January 2008, six healthcare
related hedge funds managed by Defendant FrontPoint Partners LLC
(“FrontPoint”) sold more than six million shares of Human Genome
Sciences, Inc. (“HGSI”) common stock while their portfolio manager
possessed material negative non-public information concerning the HGSI’s
clinical trial for the drug Albumin Interferon Alfa 2-a.
 On March 2, 2011, the plaintiffs filed a First Amended Class Action
Complaint, amending the named defendants and securities violations. On
March 22, 2011, a motion for appointment as lead plaintiff and for
approval of selection of lead counsel was filed. The defendants
responded to the First Amended Complaint by filing a motion to dismiss
on March 28, 2011.

--------------

INDUSTRY CLASSIFICATION:
 **SIC Code:** 0000
 **Sector:** N/A
 **Industry:** N/A

python - How to transform a html file into a human-readable txt file?

3 Answers3