14

Good evening, I have used BeautifulSoup to extract some data from a website as follows:

from BeautifulSoup import BeautifulSoup
from urllib2 import urlopen

soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

table = soup.findAll('table', attrs={ "class" : "table-horizontal-line"})

print table

This gives the following output:

[<table width="70%" class="table-horizontal-line">
<tr>
<th>Amount</th>
<th>Company or person fined</th>
<th>Date</th>
<th>What was the fine for?</th>
<th>Compensation</th>
</tr>
<tr>
<td><a name="_Hlk74714257" id="_Hlk74714257">&#160;</a>£4,000,000</td>
<td><a href="/pages/library/communication/pr/2002/124.shtml">Credit Suisse First Boston International </a></td>
<td>19/12/02</td>
<td>Attempting to mislead the Japanese regulatory and tax authorities</td>
<td>&#160;</td>
</tr>
<tr>
<td>£750,000</td>
<td><a href="/pages/library/communication/pr/2002/123.shtml">Royal Bank of Scotland plc</a></td>
<td>17/12/02</td>
<td>Breaches of money laundering rules</td>
<td>&#160;</td>
</tr>
<tr>
<td>£1,000,000</td>
<td><a href="/pages/library/communication/pr/2002/118.shtml">Abbey Life Assurance Company ltd</a></td>
<td>04/12/02</td>
<td>Mortgage endowment mis-selling and other failings</td>
<td>Compensation estimated to be between £120 and £160 million</td>
</tr>
<tr>
<td>£1,350,000</td>
<td><a href="/pages/library/communication/pr/2002/087.shtml">Royal &#38; Sun Alliance Group</a></td>
<td>27/08/02</td>
<td>Pension review failings</td>
<td>Redress exceeding £32 million</td>
</tr>
<tr>
<td>£4,000</td>
<td><a href="/pubs/final/ft-inv-ins_7aug02.pdf" target="_blank">F T Investment &#38; Insurance Consultants</a></td>
<td>07/08/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£75,000</td>
<td><a href="/pubs/final/spe_18jun02.pdf" target="_blank">Seymour Pierce Ellis ltd</a></td>
<td>18/06/02</td>
<td>Breaches of FSA Principles ("skill, care and diligence" and "internal organization")</td>
<td>&#160;</td>
</tr>
<tr>
<td>£120,000</td>
<td><a href="/pages/library/communication/pr/2002/051.shtml">Ward Consultancy plc</a></td>
<td>14/05/02</td>
<td>Pension review failings</td>
<td>&#160;</td>
</tr>
<tr>
<td>£140,000</td>
<td><a href="/pages/library/communication/pr/2002/036.shtml">Shawlands Financial Services ltd</a> - formerly Frizzell Life &#38; Financial Planning ltd)</td>
<td>11/04/02</td>
<td>Record keeping and associated compliance breaches</td>
<td>&#160;</td>
</tr>
<tr>
<td>£5,000</td>
<td><a href="/pubs/final/woodwards_4apr02.pdf" target="_blank">Woodward's Independent Financial Advisers</a></td>
<td>04/04/02</td>
<td>Pensions review failings</td>
<td>&#160;</td>
</tr>
</table>]

I would like to export this into CSV whilst keeping the table structure as displayed on the website, is this possible and if so how?

Thanks in advance for the help.

merlin_1980
  • 207
  • 2
  • 3
  • 9
  • 1
    You might want to look at this solution - http://sebsauvage.net/python/html2csv.py . Found it by Googling "html to csv python" :) – Yam Mesicka Jan 05 '13 at 01:00
  • Thanks, although that solution seems quite complicated? I am hoping there is an easier way considering I have all the data in a relatively clean format ... if not, I will attempt to follow this :-) – merlin_1980 Jan 05 '13 at 01:06

1 Answers1

27

Here is a basic thing you can try. This makes the assumption that the headers are all in the <th> tags, and that all subsequent data is in the <td> tags. This works in the single case you provided, but I'm sure adjustments will be necessary if other cases :) The general idea is that once you find your table (here using find to pull the first one), we get the headers by iterating through all th elements, storing them in a list. Then, we create a rows list that will contain lists representing the contents of each row. This is populated by finding all td elements under tr tags and taking the text, encoding it in UTF-8 (from Unicode). You then open a CSV, writing the headers first and then writing all of the rows, but using(row for row in rows if row)` to eliminate any blank rows):

In [117]: import csv

In [118]: from bs4 import BeautifulSoup

In [119]: from urllib2 import urlopen

In [120]: soup = BeautifulSoup(urlopen('http://www.fsa.gov.uk/about/media/facts/fines/2002'))

In [121]: table = soup.find('table', attrs={ "class" : "table-horizontal-line"})

In [122]: headers = [header.text for header in table.find_all('th')]

In [123]: rows = []

In [124]: for row in table.find_all('tr'):
   .....:     rows.append([val.text.encode('utf8') for val in row.find_all('td')])
   .....: 

In [125]: with open('output_file.csv', 'wb') as f:
   .....:     writer = csv.writer(f)
   .....:     writer.writerow(headers)
   .....:     writer.writerows(row for row in rows if row)
   .....: 

In [126]: cat output_file.csv
Amount,Company or person fined,Date,What was the fine for?,Compensation
" £4,000,000",Credit Suisse First Boston International ,19/12/02,Attempting to mislead the Japanese regulatory and tax authorities, 
"£750,000",Royal Bank of Scotland plc,17/12/02,Breaches of money laundering rules, 
"£1,000,000",Abbey Life Assurance Company ltd,04/12/02,Mortgage endowment mis-selling and other failings,Compensation estimated to be between £120 and £160 million
"£1,350,000",Royal & Sun Alliance Group,27/08/02,Pension review failings,Redress exceeding £32 million
"£4,000",F T Investment & Insurance Consultants,07/08/02,Pensions review failings, 
"£75,000",Seymour Pierce Ellis ltd,18/06/02,"Breaches of FSA Principles (""skill, care and diligence"" and ""internal organization"")", 
"£120,000",Ward Consultancy plc,14/05/02,Pension review failings, 
"£140,000",Shawlands Financial Services ltd - formerly Frizzell Life & Financial Planning ltd),11/04/02,Record keeping and associated compliance breaches, 
"£5,000",Woodward's Independent Financial Advisers,04/04/02,Pensions review failings, 
RocketDonkey
  • 36,383
  • 7
  • 80
  • 84
  • Thanks, this looks like the perfect solution. However, I seem to be getting a SyntaxError with the 'cat output_file.csv' line, it just reads invalid syntax? – merlin_1980 Jan 05 '13 at 09:58
  • 1
    @merlin_1980 Oh I'm sorry, should have mentioned that is an IPython-specific thing (was basically just trying to show the contents of the file). If you get up to that point, you should have the file saved in that directory. – RocketDonkey Jan 05 '13 at 10:35
  • Thanks very much :-) I didn't think about looking in the directory and opening the file manually! – merlin_1980 Jan 05 '13 at 10:52
  • @merlin_1980 No problem - I could have been more clear with that :) Good luck with everything! – RocketDonkey Jan 05 '13 at 10:53
  • This helped me too: didn't realize that a text attribute was accessible on a BeautifulSoup Tag type – Brian Whitton Dec 21 '13 at 23:29
  • Great answer. Used it today for a similar issue. Thanks! – jxmorris12 Jul 06 '16 at 04:15
  • 1
    when its thead instead of th: `headers = [header.text for header in table.find('thead').find_all('td')]` – Miss.Saturn Jun 13 '17 at 07:39