0

I start from this page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A and want to scrape all the next pages until it has reached the bottom.

For each letter A to Z the next pages'url (as shown in the browser) are https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/<index> where the index increments each time by 80. For instance the first next page is https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80. First idea was to build the url addresses based on this rule and fetch them with urllib. However, when I implement in python,

res = urllib.request.urlopen(url)
soup = BeautifulSoup(res, "lxml")

it seems that I always stay on the first page https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/.

A second idea is to get the next page from the next page button, an example of next page button is

<a href="/portailindex/LEXI/TLFI/B/480"><img src="/images/portail/right.gif" title="Page suivante" \
           border="0" width="32" height="32" alt="" />

but all I will get is again /portailindex/LEXI/TLFI/B/480 and when calling urllib.request on this, it does not increment to the next page.


So, why does https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/80 in browser work while the urllib.request brings me back to https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/ ?

Any elegant way to go from one page to the next here until it finishes nicely?

VLAZ
  • 26,331
  • 9
  • 49
  • 67
kiriloff
  • 25,609
  • 37
  • 148
  • 229

3 Answers3

1

It seems to do it

import urllib
from bs4 import BeautifulSoup
import requests
import string

dictionary = []

def get_words_in_page( url ):
    res = urllib.request.urlopen(url)
    soup = BeautifulSoup(res, "lxml")
    lst = ""
    for w in soup.findAll("a",{"href":regex}):
        dictionary.append(w.string)
        lst=w.string

base_url = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/"
    
for l in string.ascii_lowercase:    
    base_url = base_url + l.upper()    
    get_words_in_page( base_url )        
    next_index = 0    
    while True:    
        next_index += 80
        url = base_url+"/"+str(next_index)        
        try:
            res = urllib.request.urlopen(url)
        except ValueError:
            break    
        get_words_in_page( url )
kiriloff
  • 25,609
  • 37
  • 148
  • 229
0

Not very sure what's going on, but something like the following worked well for me recently:

Python 3.10.2 on Windows 10. The following code is from the context of a larger program.

from bs4 import BeautifulSoup as Soup
from urllib import request

START = 1
END = 82

BASE_URL = "https://www.cnrtl.fr/portailindex/LEXI/TLFI/A/*"

def pull(url: str) -> Soup:
    my_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}

    my_request = request.Request(url, headers=my_headers)
    html_text = request.urlopen(my_request).read()

    return Soup(html_text, 'html.parser')

def main():
    for i in range(START, END + 1):
        print(f"\nStarting page {i}...")
        url = BASE_URL.replace("*", str(i))

        soup = pull(url)

Could be that you need headers? Source

eccentricOrange
  • 876
  • 5
  • 18
  • i can't find something similar in the source code of my webpage, would you be able to check the urls I gave and check whats up? feel free to start a chat on stackoverflow, don t know how to – kiriloff Feb 23 '22 at 13:58
  • My program here seems to work for you URL too; I checked for pages 1, 17, 79, and 80. I've updated my example to work for your URL. Upon running `main()`, the `soup` variable inside the loop is what you want (I think). To help further, I'd need to know exactly what you're scraping and what you're trying to accomplish. – eccentricOrange Feb 23 '22 at 14:06
0

Just iterate over the letter href and for each use the href of the <a> that holds the arrow for next page to iterate over all sub pages.

In my opinion this would be more generic than the approache that deals with count up the numbers.

Example

from bs4 import BeautifulSoup
import requests

baseUrl = 'https://www.cnrtl.fr'
response = requests.get('https://www.cnrtl.fr/portailindex/LEXI/TLFI/A')
soup = BeautifulSoup(response.content, 'html.parser')

data = []

for url in soup.select('table.letterHeader a'):

    while True:
        response = requests.get(baseUrl+url['href'])
        soup = BeautifulSoup(response.content, 'html.parser')

        data.extend([x.text for x in soup.select('table.hometab a')])

        if (a := soup.select_one('a:has(img[title="Page suivante"])')):
            url = a
        else:
            break

        time.sleep(2)

Output

['à', 'à-plat', 'abaissement', 'abas', 'a', 'a-raciste', 'abaisser', 'abasie', 'a b c', 'à-venir', 'abaisseur', 'abasourdir', 'à contre-lumière', 'aalénien', 'abajoue', 'abasourdissant', "à l'envers", 'aaronide', 'abalober', 'abasourdissement', 'à la bonne franquette', 'ab hoc et ab hac', 'abalone', 'abat', 'à muche-pot', 'ab intestat', 'abalourdir', 'abat-chauvée', 'à musse-pot', 'ab irato', 'abalourdissement', 'abat-faim', 'à pic', 'ab ovo', 'abandon', 'abat-feuille', 'à posteriori', 'aba', 'abandonnataire', 'abat-flanc', 'à priori', 'abaca', 'abandonné', 'abat-foin', 'à tire-larigot', 'abaddir', 'abandonnée', 'abat-joue', 'à vau', 'abadie', 'abandonnement', 'abat-jour', 'à vau-de-route', 'abadis', 'abandonnément', 'abat-relui', 'à vau-le-feu', 'abaissable', 'abandonner', 'abat-reluit', 'à-bas', 'abaissant', 'abandonneur', 'abat-son', 'à-compte', 'abaisse', 'abandonneuse', 'abat-vent', 'a-humain', 'abaissé', 'abaque', 'abat-voix', 'a-mi-la', 'abaisse-langue', 'abarticulaire', 'abatage', 'à-pic', 'abaissée', 'abarticulation', 'abâtardi', 'abâtardir', 'abbatial', 'abdominal', 'abécé', 'abâtardissement', 'abbatiale', 'abdominale', 'abécédaire', 'abatée', 'abbatiat', 'abdominien', 'abécédé', 'abatis', 'abbattre', 'abdominienne', 'abéchement', 'abatre', 'abbaye', 'abdomino-coraco-huméral', 'abécher', 'abattable', 'abbé', 'abdomino-coraco-humérale', 'abecquage', 'abattage', 'abbesse', 'abdomino-génital', 'abecquement', 'abattant', 'abbevillien', 'abdomino-génitale', 'abecquer', 'abattée', 'abbevillienne', 'abdomino-guttural', 'abecqueuse', 'abattement', 'abbevillois', 'abdomino-gutturale', 'abée', 'abatteur', 'abbevilloise', 'abdomino-huméral', 'abeillage', 'abatteuse', 'abcéder', 'abdomino-humérale', 'abeille', 'abattis', 'abcès', 'abdomino-périnéal', 'abeillé', 'abattoir', 'abdalas', 'abdomino-scrotal', 'abeiller', 'abattre', 'abdéritain', 'abdomino-thoracique', 'abeillier', 'abattu', 'abdéritaine', 'abdomino-utérotomie', 'abeillon', 'abattue', 'abdicataire', 'abdominoscopie', 'abélien', 'abatture', 'abdication', 'abdominoscopique', 'abéquage', 'abax', 'abdiquer', 'abducteur', 'abéquer', 'abbadie', 'abdomen', 'abduction', 'abéqueuse', 'aber', 'abiétine', 'abjurer', 'aboi', 'aberrance', 'abiétiné', 'ablatif', 'aboiement', 'aberrant', 'abiétinée', 'ablation', 'aboilage', 'aberration', 'abiétique', 'ablativo', 'abolir', 'aberrer', 'abigaïl', 'able', 'abolissable', 'aberrographe', 'abigéat', 'ablégat', 'abolissement', 'aberroscope', 'abigotir', 'ablégation', 'abolitif', 'abessif', 'abîme', 'abléphare', 'abolition', 'abêtifier', 'abîmé', 'ablépharie', 'abolitionnisme', 'abêtir', 'abîmement', 'ablépharoplastique', 'abolitionniste', 'abêtissant', 'abîmer', 'ableret', 'aboma', 'abêtissement', 'abiogenèse', 'ablet', 'abominable', 'abêtissoir', 'abiose', 'ablette', 'abominablement', 'abhorrable', 'abiotique', 'ablier', 'abomination', 'abhorré', 'abject', 'abluant', 'abominer', 'abhorrer', 'abjectement', 'abluante', 'abondamment', 'abicher', 'abjection', 'abluer', 'abondance', 'abies', 'abjurateur', 'ablution', 'abondant', 'abiétacée', 'abjuration', 'ablutionner', 'abonder', 'abiétin', 'abjuratoire', 'abnégation', 'abonnable',...]
HedgeHog
  • 22,146
  • 4
  • 14
  • 36