0

Trying to extract a wikipedia list from: https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes using BeautifulSoup.

this is my code:

wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page)
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia

Data = [[] for _ in range(9)] # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells)==9: # The start and end don't include a <td> tag
        for i in range(9):
            Data[i].append(cells[i].find(text=True))

This works quite well apart from a single value in the names column, The hurricane "New England". This is the HTML code that contains that element:

<td><span data-sort-value="New England !"> <a href="/wiki/1938_New_England_hurricane" title="1938 New England hurricane">"New England"</a></span></td>

The entry for the name in that hurricane is ' ', I think that the space between <span> and <a> is causing this problem. Is there a way to fix this in .find? Is there a smarter way to access lists in Wikipedia? How can I avoid this in the future?

2 Answers2

1

Simplest way to read a table into a data frame is read_html():

import pandas as pd
pd.read_html(wiki)[1]

Output:

    Name    Dates as aCategory 5    Duration as aCategory 5 Sustainedwind speeds    Pressure    Areas affected  Deaths  Damage(USD) Refs
0   "Cuba"  October 19, 1924    12 hours    165 mph (270 km/h)  910 hPa (26.87 inHg)    Central America, Mexico, CubaFlorida, The Bahamas   90  NaN [12]
1   "San Felipe IIOkeechobee"   September 13–14, 1928   12 hours    160 mph (260 km/h)  929 hPa (27.43 inHg)    Lesser Antilles, The BahamasUnited States East...   4000    NaN NaN

...

To improve you example you can do the following:

import requests
from bs4 import BeautifulSoup

wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = requests.get(wiki).content
soup = BeautifulSoup(page,'lxml')
table=soup.find('table', class_="wikitable sortable") # The class of the list in wikipedia

data = []
for row in table.select('tr')[1:-1]:
    cells = []
    for cell in row.select('td'):
        cells.append(cell.get_text('',strip=True))
    data.append(cells)

get_text('',strip=True) will get the text from the td and strip the space in front/end.

HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Thanks :). I couldn't find any documentation on the issue, where could I go in the future if a BS4 problem does occur? – ARunningFridge Sep 01 '21 at 15:02
  • And, an earlier comment (now seemingly deleted) suggested pandas and it did work with a lot less hassle. but the damage values are all NaNs whereas in BS4 this doesn't happen. Is there a quick fix? – ARunningFridge Sep 01 '21 at 15:03
  • You can use the [docs here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) – HedgeHog Sep 01 '21 at 15:05
  • I tried using them but didn't find any information on the options for .find. you did not use it at all, it seems there are some redundancies in BS4 – ARunningFridge Sep 01 '21 at 15:07
  • 1
    In your example you are working on an ["older Version/Syntax"](https://stackoverflow.com/questions/12339323/difference-between-findall-and-find-all-in-beautifulsoup) to read more about the `find_all()`/`find()` [start here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#calling-a-tag-is-like-calling-find-all) – HedgeHog Sep 01 '21 at 15:21
1

This will normalise the text and hopefully give you what you're looking for:-

import urllib
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_Category_5_Atlantic_hurricanes"
page = urllib.request.urlopen(wiki)
soup = BeautifulSoup(page, 'lxml')
# The class of the list in wikipedia
table = soup.find('table', class_="wikitable sortable")

Data = [[] for _ in range(9)]  # I intend to turn this into a DataFrame
for row in table.findAll('tr'):
    cells = row.findAll('td')
    if len(cells) == 9:  # The start and end don't include a <td> tag
        for i, cell in enumerate(cells):
            Data[i].append(cell.text.strip().replace('"', ''))
print(Data)