How do I scrape link of only one column from a Wikipedia table with python?

Question

I'm a beginner and this is my first question on the forum. As said in the title, my goal is to scrape the links from only one column of the table of that wiki page : https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain

I've already watched several contributions asked on that forum (especially this one How do I extract text data in first column from Wikipedia table?) but none of them seem to answer my questions (and from what I understand, using a Dataframe is not a solution since it is a sort of copy/paste of the table while I want to get links).

Here is my code so far

import requests
res=requests.get("https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain")

from bs4 import BeautifulSoup as bs
soup=bs(res.text,"html.parser")
table=soup.find('table','wikitable')
links=table.findAll('a')
communes={}
for link in links:
    url=link.get("href","")
    communes[link.text.strip()]=url
print(communes)

Thanks in advance for you answers !

What column specifically do you want to scrape? – MendelG Mar 17 '21 at 14:27 — MendelG, Mar 17 '21 at 14:27
only the first column to get the links of the cities – Anthony SULIO Mar 17 '21 at 17:58 — Anthony SULIO, Mar 17 '21 at 17:58

MendelG · Accepted Answer · 2021-03-17T15:02:21.153

To scrape a specific column, you can use the nth-of-type(n) CSS Selector. In order to use a CSS Selector, use the select() method instead of find_all().

For example, to only scrape the sixth column, select the sixth <td> using soup.select("td:nth-of-type(6)")

Here's an example of how to print all the links from only the fifth column:

import requests
from bs4 import BeautifulSoup


BASE_URL = "https://fr.wikipedia.org"
URL = "https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

# The following will find all `a` tags under the fifth `td` of it's type, which is the fifth column
for tag in soup.select("td:nth-of-type(5) a"):
    print(BASE_URL + tag["href"])

Output:

https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-1
https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-2
https://fr.wikipedia.org/wiki/Canton_d%27Amb%C3%A9rieu-en-Bugey
https://fr.wikipedia.org/wiki/Canton_de_Villars-les-Dombes
https://fr.wikipedia.org/wiki/Canton_de_Belley
...

I clicked one up to say that "your answer is useful" but it doesn't count yet since I'm less than 15 in reputation — Anthony SULIO, Mar 17 '21 at 20:21

score 1 · Answer 2 · answered Mar 18 '21 at 01:00

1

If you want the first column, containing the communes, you can also use the fact it is left aligned in an attribute = value selector

commune_links = ['https://fr.wikipedia.org' + i['href'] for i in soup.select('[style="text-align:left;"] a')]

answered Mar 18 '21 at 01:00

QHarr

83,427
12
54
101

And do you know why it works with this link https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain and not this one https://fr.wikipedia.org/wiki/Liste_des_communes_du_Pas-de-Calais ? – Anthony SULIO Mar 18 '21 at 13:58
For that one you need to specify the table as well `commune_links = ['https://fr.wikipedia.org' + i['href'] for i in soup.select('.titre-en-couleur [style="text-align:left;"] a')]` – QHarr Mar 18 '21 at 14:26
Hi @QHarr I have another question. Do you know why i can get all the links with `commune_links = [i['href'] for i in soup.select('.titre-en-couleur a')]` but I can't get the number of the population with `commune_links = [i['data-sort-value'] for i in soup.select('.titre-en-couleur td')]` – Anthony SULIO Mar 20 '21 at 13:07
1

you need `soup.select(".titre-en-couleur td[data-sort-value]")` – QHarr Mar 21 '21 at 03:52

How do I scrape link of only one column from a Wikipedia table with python?

2 Answers2