Trying to extract a wikipedia list from: https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes using BeautifulSoup.
this is my code:
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia
Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
cells = row.findAll('td')
if len(cells)==9: # The start and end don't include a <td> tag
for i in range(9):
Data[i].append(cells[i].find(text=True))
This works quite well apart from a single value in the names column, The hurricane "New England". This is the HTML code that contains that element:
<td><span data-sort-value="New England !"> <a href="/wiki/1938_New_England_hurricane" title="1938 New England hurricane">"New England"</a></span></td>
The entry for the name in that hurricane is ' ', I think that the space between <span>
and <a>
is causing this problem.
Is there a way to fix this in .find? Is there a smarter way to access lists in Wikipedia?
How can I avoid this in the future?