1

I'm a beginner and this is my first question on the forum. As said in the title, my goal is to scrape the links from only one column of the table of that wiki page : https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain

I've already watched several contributions asked on that forum (especially this one How do I extract text data in first column from Wikipedia table?) but none of them seem to answer my questions (and from what I understand, using a Dataframe is not a solution since it is a sort of copy/paste of the table while I want to get links).

Here is my code so far

import requests
res=requests.get("https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain")

from bs4 import BeautifulSoup as bs
soup=bs(res.text,"html.parser")
table=soup.find('table','wikitable')
links=table.findAll('a')
communes={}
for link in links:
    url=link.get("href","")
    communes[link.text.strip()]=url
print(communes)

Thanks in advance for you answers !

2 Answers2

1

To scrape a specific column, you can use the nth-of-type(n) CSS Selector. In order to use a CSS Selector, use the select() method instead of find_all().

For example, to only scrape the sixth column, select the sixth <td> using soup.select("td:nth-of-type(6)")

Here's an example of how to print all the links from only the fifth column:

import requests
from bs4 import BeautifulSoup


BASE_URL = "https://fr.wikipedia.org"
URL = "https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain"

soup = BeautifulSoup(requests.get(URL).content, "html.parser")

# The following will find all `a` tags under the fifth `td` of it's type, which is the fifth column
for tag in soup.select("td:nth-of-type(5) a"):
    print(BASE_URL + tag["href"])

Output:

https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-1
https://fr.wikipedia.org/wiki/Canton_de_Bourg-en-Bresse-2
https://fr.wikipedia.org/wiki/Canton_d%27Amb%C3%A9rieu-en-Bugey
https://fr.wikipedia.org/wiki/Canton_de_Villars-les-Dombes
https://fr.wikipedia.org/wiki/Canton_de_Belley
...
MendelG
  • 14,885
  • 4
  • 25
  • 52
1

If you want the first column, containing the communes, you can also use the fact it is left aligned in an attribute = value selector

commune_links = ['https://fr.wikipedia.org' + i['href'] for i in soup.select('[style="text-align:left;"] a')]
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • And do you know why it works with this link https://fr.wikipedia.org/wiki/Liste_des_communes_de_l%27Ain and not this one https://fr.wikipedia.org/wiki/Liste_des_communes_du_Pas-de-Calais ? – Anthony SULIO Mar 18 '21 at 13:58
  • For that one you need to specify the table as well `commune_links = ['https://fr.wikipedia.org' + i['href'] for i in soup.select('.titre-en-couleur [style="text-align:left;"] a')]` – QHarr Mar 18 '21 at 14:26
  • Hi @QHarr I have another question. Do you know why i can get all the links with `commune_links = [i['href'] for i in soup.select('.titre-en-couleur a')]` but I can't get the number of the population with `commune_links = [i['data-sort-value'] for i in soup.select('.titre-en-couleur td')]` – Anthony SULIO Mar 20 '21 at 13:07
  • 1
    you need `soup.select(".titre-en-couleur td[data-sort-value]")` – QHarr Mar 21 '21 at 03:52