I have a lot of html files look like this:
<font face="Garmond,Helvetica,Times" size="2" color="#330066">
<b>
Summary:
</b>
According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
<br />
<br />
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
<br />
<hr width="50%" align="left" />
INDUSTRY CLASSIFICATION:
<br />
<b>
SIC Code:
</b>
0000
<br />
<b>
Sector:
</b>
N/A
<br />
<b>
Industry:
</b>
N/A
<br />
</font>
what I want to do is taking out the text in the middle of the file and transform it into a human-readable format. in this example, it is:
According to the complaint filed January 04, 2011, over a six-week period in December 2007 and January 2008, six healthcare related hedge funds managed by Defendant FrontPoint Partners LLC ("FrontPoint") sold more than six million shares of Human Genome Sciences, Inc. ("HGSI") common stock while their portfolio manager possessed material negative non-public information concerning the HGSI's clinical trial for the drug Albumin Interferon Alfa 2-a.
On March 2, 2011, the plaintiffs filed a First Amended Class Action Complaint, amending the named defendants and securities violations. On March 22, 2011, a motion for appointment as lead plaintiff and for approval of selection of lead counsel was filed. The defendants responded to the First Amended Complaint by filing a motion to dismiss on March 28, 2011.
I know I have to do 3 things, they are:
- take out the text in the middle of the file
- replace
"<br />"
with"\n"
- replace
" "
with" "
(one space)
I know the latter 2 things are easy, just using the replace method in Python, but I don't know how to achieve the first goal.
I know regular expression and BeautifulSoup a little, but I don't know how to apply them to this question.
Can someone help me?
Thanks, and I'm sorry for my poor English.
@Paul: I want just a section which is the summary. My teacher (who doesn't know much about computers) gives me a lot of html files and asks me to transform them into a format which is proper for data mining (My teacher try to use SAS to do this). I don't know SAS, but I think it may used to handle a lot of txt files, so I want to transform these html files into normal txt files.
@Owen: I need to handle a lot of html files and I think this problem isn't too difficult to handle, so I want to solve it directly with Python.