scraping data from unknown number of pages using beautiful soup

Question

I want to parse some info from website that has data spread among several pages.

The problem is I don't know how many pages there are. There might be 2, but there might be also 4, or even just one page.

How can I loop over pages when I don't know how many pages there will be?

I know however the url pattern which looks something like in the code below.

Also, the pages names are not plain numbers but they are in 'pe2' for page 2 and 'pe4' for page 3 etc. so can't just loop over range(number).

This dummy code for the loop I am trying to fix.

pages=['','pe2', 'pe4', 'pe6', 'pe8',]

import requests 
from bs4 import BeautifulSoup
for i in pages:
    url = "http://www.website.com/somecode/dummy?page={}".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)
    #rest of the scraping code

May be it is helpful http://stackoverflow.com/questions/26497722/scrape-multiple-pages-with-beautifulsoup-and-python — manvi77, Apr 04 '17 at 14:12
Requests don't raise on 404 responses - simply check if `r.status_code` > 299 — jsbueno, Apr 04 '17 at 14:21

dot.Py · Accepted Answer · 2017-04-04T16:52:36.897

3

You can use a while loop that will stop to run when encounters an exception.

Code:

from bs4 import BeautifulSoup
from time import sleep
import requests 

i = 0
while(True):
    try:
        if i == 0:
            url = "http://www.website.com/somecode/dummy?page=pe"
        else:
            url = "http://www.website.com/somecode/dummy?page=pe{}".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, 'html.parser')

        #print page url
        print(url)

        #rest of the scraping code

        #don't overflow website
        sleep(2)

        #increase page number
        i += 2
    except:
        break

Output:

http://www.website.com/somecode/dummy?page
http://www.website.com/somecode/dummy?page=pe2
http://www.website.com/somecode/dummy?page=pe4
http://www.website.com/somecode/dummy?page=pe6
http://www.website.com/somecode/dummy?page=pe8
...
... and so on, until it faces an Exception.

edited Apr 04 '17 at 16:52

answered Apr 04 '17 at 14:50

dot.Py

5,007
5
31
52

Cool I think this almost solves my problem except, first page has no "pe" in its url link. And Then next one is pe2, then it grows +2 for each next one. Do you have idea how to solve that without creating a list of lots of pe*? – Alex T Apr 04 '17 at 16:47
@AlexT check the edited answer. You can achieve this by using an `if/else` clause while incrementing the value of variable `i` by `2` in each iteration. – dot.Py Apr 04 '17 at 16:52
Hmm somehow its not stopping after going over the page that doesnt exist. How come? – Alex T Apr 04 '17 at 17:24
Oh I think I see now, if I pass higher number page than its available to select normally from site it still exists with like one data... What would you recommend doing in that case? – Alex T Apr 04 '17 at 17:27
Hmm... maybe you can try to fetch some info that exists in the previous "ok" pages. A `title` that exists in the pages that are "ok" and don't exists in the "not-ok" pages, for example... – dot.Py Apr 04 '17 at 17:31
Okay i figured out that partly, however there is problem with sleep(2), when I use it the code kinda stops and prints just the first url – Alex T Apr 04 '17 at 18:26

scraping data from unknown number of pages using beautiful soup

1 Answers1

Linked