Scrape tables from Wikipedia using python?

Question

I am trying to scrape table data from this Wikipedia page: https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal I've tried using pandas pd.read_html syntax but it doesn't work for the table I'm trying to scrape (Confirmed COVID-19 cases in Nepal by district).

I tried using Beautifulsoup and pandas to scrape the data, but it doesn't work

url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal'
r = requests.get(url)
soup = BeautifulSoup(r.text,'html.parser')
table = soup.find('table', {'class': 'wikitable'})
dfs=pd.read_html(table)
dfs[0]

Lomtrur · Accepted Answer · 2020-04-06T08:55:55.923

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal'
# dfs = pd.read_html("https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal", flavor="lxml")
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
table = soup.find('table', {'class': 'wikitable'})
dfs = pd.read_html(str(table).replace("2;", "2"))
print(dfs[0])

This works, you need to convert the table to a string for read_html to function properly.

For some reason the rowspan and colspan attributes show up as "2;" and I can't find a nice way to fix it - pd.read_html() doesn't like that so I just use .replace().

In theory this should accomplish the same thing but shorter and easier, but it has the same issue with rowspan:

dfs = pd.read_html("https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Nepal", flavor="lxml")
print(dfs[0])  # whatever the index of the table is

It seems like this is a possible bug with read_html (pandas version 1.0.3).

Scrape tables from Wikipedia using python?

1 Answers1