0

I created a script to scrape some data on this website : https://www.bilansgratuits.fr/

My script go on all the links like this one : https://www.bilansgratuits.fr/secteurs/agriculture,a.html

And go for all the links in the previous links like this one : https://www.bilansgratuits.fr/classement/0111Z/default.html

And I scrape the name, the sector and the sub-sector.

Here's my script :

import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re    

url = 'https://www.bilansgratuits.fr/'

links1 = []

results = requests.get(url)    

soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "container_rss blocSecteursActivites"}).find_all('a', href=True)]

secteur = [a.text for a in soup.find("div", {"class": "container_rss blocSecteursActivites"}).find_all('a', href=True)]

links1.pop()
secteur.pop()
    
secteurs = []
soussecteurs = []
names = []
#rankings = []

root_url = 'https://www.bilansgratuits.fr/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]    

for url, secteur in zip(urls1[:1], secteur[:1]):

    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    links = [a['href']  for a in soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)]

    soussecteur = [a.text for a in soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)]
  
    root_url = 'https://www.bilansgratuits.fr/'
    urls = [ '{root}{i}'.format(root=root_url, i=i) for i in links ]

    for url, soussecteur in zip(urls, soussecteur):

        results = requests.get(url)

        soup = BeautifulSoup(results.text, "html.parser")

        try:
            name = [a.text for a in soup.find("div", {"class": "donnees"}).find_all('a', href=True)]


            for i in name:
                secteurs.append(secteur)

            for i in name:
                soussecteurs.append(soussecteur)
      
        except:
            name = [a.text for a in soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)]

            for i in name:
                secteurs.append(secteur)

            for i in name:
                soussecteurs.append(soussecteur)  
          
        names.append(name)  

for i in range(0,len(names)):    
    rx = re.compile(r'^\s+$')

    names[i] = [item.split() for item in names[i] if not rx.match(item)]    

res = []
for list in names:
    for lis in list:
        res.append(' '.join([w for w in lis]))       

data = pd.DataFrame({
    'names' : res,
    'Secteur' : secteurs,
    "Sous-Secteur" : soussecteurs,
    #"Rankings" : rankings
    })


data.to_csv('dftest.csv', sep=';', index=False, encoding = 'utf_8_sig')

I get an output like this : output

But I also would like to scrape this information : ranking

The ranking to the left of the name of the companies.

And obtained something like that : output2

When there is no ranking, just add "No classement" to the variable.

But I cannot figure it out how to add this to my script.

Any ideas ?

Mithos
  • 33
  • 5
  • Not related but kinda is: *do not* use that much white-space in your code. Read [this](https://www.python.org/dev/peps/pep-0008/). – baduker Feb 26 '21 at 13:39
  • Sorry baduker, you already told me that few days ago and I read your link, but I thought I haven't used that much white-space.. – Mithos Feb 26 '21 at 13:56
  • I edited my post, sorry again – Mithos Feb 26 '21 at 13:56

1 Answers1

1

I cleaned up your code a bit while working on it. The trick is to perform everything step by step.

First locate all the links that you want from the main page using get_main_links. Then I find all sub sectors using get_sub_sectors. This function has two parts, one that extracts all the sub sectors and a second part that locates all the rankings.

The ranking code is heavily based on:

Note some extra things in my code:

  1. I do not have bare try/except statements, since this would also catch KeyboardInterrupts and many other (useful) errors, but the AttributeError that you would get if there is no element found.

  2. I put all my data in a list of dict. The reason for this is that DataFrame can handle this form directly and I do not need to specify any columns later on.

  3. I am using tqdm to print the progress, because the extra addition takes a bit more time, and it is nice to see that your program is actually doing something. (pip install tqdm).

import logging
from pprint import pprint
from typing import List, Dict

import tqdm
import requests
import pandas as pd

from bs4 import BeautifulSoup


def get_main_links(root: str, soup: BeautifulSoup) -> List[Dict[str, str]]:
    """

        Extract the links and secteur tags.

        :return (dict)
            Dictionary containing the following keys

            - href: (str) relative link path from root website
            - secteur: (str) sector name
            - url: (str) absolute url to get to the page.
    """
    data_sectors = []
    all_links = soup.find('div', {'class': 'container_rss blocSecteursActivites'}).find_all('a', href=True)
    for reference in all_links:  # Limit search by using [:2]
        data_sectors.append({
            'href': reference['href'],
            'secteur': reference.text,
            'url': f"{root}{reference['href']}"
        })
    return data_sectors


def get_sub_sectors(root: str, data_sector: Dict[str, str]):
    """ Retreive the sub sectors (sous-secteur).  """
    data_sub_sectors = []

    for sector in tqdm.tqdm(data_sector):
        result = requests.get(sector['url'])
        soup = BeautifulSoup(result.text, 'html.parser')
        sub_sectors = _extract_sub_sectors(root, soup)

        for sub_sector in tqdm.tqdm(sub_sectors, leave=False):  # Limit search by using [:2]
            result = requests.get(sub_sector['url'])
            soup = BeautifulSoup(result.text, 'html.parser')
            rankings = _extract_sub_sector_rankings(sub_sector['url'], soup)

            for ranking in rankings:
                data_sub_sectors.append({
                    'names': ranking['name'],
                    'Secteur': sector['secteur'],
                    'Sous-Secteur': sub_sector['name'],
                    'Rankings': ranking['rank']
                })
    return data_sub_sectors


def _extract_sub_sectors(root: str, soup: BeautifulSoup):
    data_sub_sectors = []

    try:
        sub_sectors = soup.find("div", {"class": "donnees"}).find_all('a', href=True)
    except AttributeError:
        sub_sectors = soup.find("div", {"class": "listeEntreprises"}).find_all('a', href=True)

    for sub_sector in sub_sectors:
        data_sub_sectors.append({
            'href': sub_sector['href'],
            'name': sub_sector.text,
            'url': f"{root}{sub_sector['href']}"
        })
    return data_sub_sectors


def _extract_sub_sector_rankings(root, soup: BeautifulSoup):
    data_sub_sectors_rankings = []

    try:
        entries = soup.find('div', {'class': 'donnees'}).find_all('tr')
    except AttributeError:
        logging.info(f'Failed extracting: {root}')
        entries = []

    for entry in entries:
        data_sub_sectors_rankings.append({
            'rank': entry.find('td').text,
            'name': entry.find('a').text
        })
    return data_sub_sectors_rankings


if __name__ == '__main__':
    url = 'https://www.bilansgratuits.fr/'
    result = requests.get(url)
    soup = BeautifulSoup(result.text, 'html.parser')

    # Obtain all sector data.
    data = get_main_links(url, soup)
    pprint(data)

    # Obtain all sub sectors
    data = get_sub_sectors(url, data)
    pprint(data)

    df = pd.DataFrame(data)
    print(df.columns)
    print(df.head())
    print(df.iloc[0])
    print(df.iloc[1])

    df.to_csv('dftest.csv', sep=';', index=False, encoding='utf_8_sig')

Output

names                                              ;Secteur                   ;Sous-Secteur                                                                                                          ;Rankings
LIMAGRAIN EUROPE (63360)                           ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;1
LIMAGRAIN (63360)                                  ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;2
TOP SEMENCE (26160)                                ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;3
CTRE SEMENCE UNION COOPERATIVE AGRICOLE (37310)    ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;4
SEMENCES DU SUD (11400)                            ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;5
KWS MOMONT (59246)                                 ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;6
TECHNISEM (49160)                                  ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;7
AGRI-OBTENTIONS (78280)                            ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;8
LS PRODUCTION (75116)                              ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;9
SAS VALFRANCE SEMENCES (60300)                     ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;10
DURANCE HYBRIDES (13610)                           ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;11
STRUBE FRANCE (60190)                              ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;12
ID GRAIN (31330)                                   ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;13
SAGA VEGETAL (33121)                               ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;14
SOC COOP AGRICOLE VALLEE RHONE VALGRAIN (26740)    ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;15
ALLIX SARL (33127)                                 ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;16
PAMPROEUF (79800)                                  ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;17
PERDIGUIER FOURRAGES (84310)                       ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;18
PANAM FRANCE (31340)                               ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;19
HENG SIENG (57260)                                 ;A - Agriculture           ;0111Z - Culture de céréales (à l'exception du riz), de légumineuses et de graines oléagineuses    ;20
HM.CLAUSE (26800)                                  ;A - Agriculture           ;0113Z - Culture de légumes, de melons, de racines et de tubercules                                ;1
VILMORIN-MIKADO (49250)                            ;A - Agriculture           ;0113Z - Culture de légumes, de melons, de racines et de tubercules                                ;2
SARL FERME DE LA MOTTE (41370)                     ;A - Agriculture           ;0113Z - Culture de légumes, de melons, de racines et de tubercules                                ;3
RIJK ZWAAN FRANCE (30390)                          ;A - Agriculture           ;0113Z - Culture de légumes, de melons, de racines et de tubercules                                ;4
LE JARDIN DE RABELAIS (37420)                      ;A - Agriculture           ;0113Z - Culture de légumes, de melons, de racines et de tubercules                                ;5
RENAUD & FILS SARL (17800)                         ;A - Agriculture           ;0113Z - Culture de légumes, de melons, de racines et de tubercules                                ;6

Edit

in order to print No classify to the rank or name you can change the _extract_sub_sector_rankings using:


def _extract_sub_sector_rankings(root, soup: BeautifulSoup):
    data_sub_sectors_rankings = []

    try:
        entries = soup.find('div', {'class': 'donnees'}).find_all('tr')
    except AttributeError:
        print(f"\rFailedExtracting: {root}", )
        return [{'rank': 'No classify', 'name': 'No classify'}]

    for entry in entries:
        data_sub_sectors_rankings.append({
            'rank': entry.find('td').text,
            'name': entry.find('a').text
        })
    return data_sub_sectors_rankings

Before I used logging.info, but the default logging level is warning to show up. If you would use logging.warning instead you would see all the links that failed. Instead of adding the No classify they got skipped in the original answer.

If you would like to give a different ranking or name you can adjust the return value.

Thymen
  • 2,089
  • 1
  • 9
  • 13
  • Thanks for your answer but I would like to keep a structure reasonably similar to mine :) She is fast (~ 3 minutes to run) and convenient. But thank you a lot anyway !! – Mithos Feb 26 '21 at 15:54
  • I didn't change your structure, I am using your code. The reason it is slower is because you want to get the extra data for each of the subsections, which means that you now have to parse each of them, which you didn't had to do before. Without the ranking code, it takes me about 5 seconds to run this. – Thymen Feb 26 '21 at 17:43
  • Maybe my 3th note wasn't clear.The whole program takes about 2-3 minutes to run for me, but I added the progress bar so I could see that the program is doing things, since for testing I only needed a bit of data. P.S. If the progressbar is jumping to new lines, you can use `pip install colorama`, this should prevent the jumping around. – Thymen Feb 26 '21 at 17:55
  • I just saw, but sometime some pages haven't any ranking. Your script seems to does'nt deal with that ? In the dftest, I didn't see any "No classify". – Mithos Mar 01 '21 at 09:35
  • I edited the answer to provide the `No classify` to the entry. – Thymen Mar 01 '21 at 10:54
  • `Traceback (most recent call last): File "scraping_bilan.py", line 103, in data = get_sub_sectors(url, data) File "scraping_bilan.py", line 47, in get_sub_sectors rankings = _extract_sub_sector_rankings(sub_sector['url'], soup) File "scraping_bilan.py", line 90, in _extract_sub_sector_rankings return data_sub_sectors_rank NameError: name 'data_sub_sectors_rank' is not defined` – Mithos Mar 01 '21 at 12:36
  • That's doesn't work, it just put one row whith the name of the company "no classify". i would like to have all the name, but in th variable `rankings`-> "No classify" for all the name with no ranking. – Mithos Mar 01 '21 at 14:48
  • The above error is a copying mistake the value should be `data_sub_sectors_rankings `, instead of `data_sub_sectors_rank` (fixed it in the answer). As indicated in the answer, you can change `return [{'rank': 'No classify', 'name': 'No classify'}]` to be what you want. In this case you would have to provide your required name to the `name` key. – Thymen Mar 01 '21 at 18:47
  • What do you mean by `name` key ? – Mithos Mar 02 '21 at 10:58
  • `{'rank': 'No classify', 'name': 'No classify'}` is a dictionary with the keys `rank` and `name`. Change the value of the `name` to what you want it to be. – Thymen Mar 02 '21 at 11:11