How to download as a Pandas Data Frame tabel from Wikipedia in Python?

Question

I would like to download table from Wikipedia.org from this link as a Pandas Data Frame to Jupyter Lab: https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT

There is only one table and not complicated, how can I do that in Python ?

score 1 · Answer 1 · answered Aug 19 '21 at 10:52

Type 1:

Just use pandas method pd.read_html method and from extract what so ever df you want

import pandas as pd
res=pd.read_html("https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT")
df=res[3]

Type 2:

you can use both request and bs4 module to find table and parse data to pandas method

import requests
from bs4 import BeautifulSoup
res=requests.get("https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT")
soup=BeautifulSoup(res.text,"html.parser")

data=soup.find_all("table")[3]
df=pd.read_html(str(data))
df[0]

Output:

    Nazwa powiatu   TERYT
0   aleksandrowski  04 01
1   augustowski     20 01
.   .....          ..

nadirhan · Answer 2 · 2021-08-19T10:34:39.067

0

You need to scrape HTML using requests library, after you need to search on tag using library (i use BeautifulSoup). The code is similar to this:

import requests
from bs4 import BeautifulSoup

URL = "https://pl.wikisource.org/wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT"
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find("div", {"id":"mw-content-text"}).find("table",{"border":1}).find_all("td")
namelist = [results[i].text for i in range(0,len(results),2)]
numberlist = [results[i].text.strip('\n') for i in range(1,len(results),2)]

Then it returns a value of type string. Or you can get all values as a list. It's very simple to convert to pandas after.

edited Aug 19 '21 at 10:34

answered Aug 19 '21 at 10:04

nadirhan

160
2
12

I have error like below using your code: ConnectionError: HTTPSConnectionPool(host='pl.wikisource.org', port=443): Max retries exceeded with url: /wiki/Polskie_powiaty_wed%C5%82ug_kodu_TERYT (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 101] Network is unreachable')) – dingaro Aug 19 '21 at 10:14
Could you try answers on this question: https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url-in-requests – nadirhan Aug 19 '21 at 10:18
Hmm there is many answers, could you write the best and modify your answe? Because I am not able to change it – dingaro Aug 19 '21 at 10:24
I edited the code now its work well. But i can't fix your error you should fix it yourself. – nadirhan Aug 19 '21 at 10:35

How to download as a Pandas Data Frame tabel from Wikipedia in Python?

2 Answers2