Web crawler not able to process more than one webpage

Question

I am trying to extract some information about mtg cards from a webpage with the following program but I repeatedly retrieve information about the initial page given(InitUrl). The crawler is unable to proceed further. I have started to believe that i am not using the correct urls or maybe there is a restriction in using urllib that slipped my attention. Here is the code that i struggle with for weeks now:

import re
from math import ceil
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup

InitUrl = "https://mtgsingles.gr/search?q=dragon"
NumOfCrawledPages = 0
URL_Next = ""
NumOfPages = 4   # depth of pages to be retrieved

query = InitUrl.split("?")[1]


for i in range(0, NumOfPages):
    if i == 0:
        Url = InitUrl
    else:
        Url = URL_Next

    print(Url)

    UClient = uReq(Url)  # downloading the url
    page_html = UClient.read()
    UClient.close()

    page_soup = soup(page_html, "html.parser")

    cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})

    for card in cards:
        card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

        if len(card.div.contents) > 3:
            cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
        else:
            cardP_T = "Does not exist"

        cardType = card.contents[3].text
        print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")

    try:
        URL_Next = InitUrl + "&page=" + str(i + 2)

        print("The next URL is: " + URL_Next + "\n")
    except IndexError:
        print("Crawling process completed! No more infomation to retrieve!")
    else:
        NumOfCrawledPages += 1
        Url = URL_Next
    finally:
        print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

Have a look at this - in regards to how try except else finally work. https://stackoverflow.com/a/31626974/8240959 — jlaur, Jun 19 '18 at 14:36
I did but i did not noticed what is wrong with my code concerning the try-except-else-finally statement. — Petris, Jun 19 '18 at 19:51

score 1 · Accepted Answer · answered Jun 19 '18 at 22:00

One of the reasons your code fail is, that you don't use cookies. The site seem to require these to allow paging.

A clean and simple way of extracting the data you're interested in would be like this:

import requests
from bs4 import BeautifulSoup

# the site actually uses this url under the hood for paging - check out Google Dev Tools
paging_url = "https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon"
return_list = []
# the page-scroll will only work when we support cookies
# so we fetch the page in a session
session = requests.Session()
session.get("https://mtgsingles.gr/")

All pages have a next button except the last one. So we use this knowledge to loop until the next-button goes away. When it does - meaning that the last page is reached - the button is replaced with a 'li'-tag with the class of 'next hidden'. This only exists on the last page

Now we're ready to start looping

page = 1 # set count for start page
keep_paging = True # use flag to end loop when last page is reached
while keep_paging:
    print("[*] Extracting data for page {}".format(page))
    r = session.get(paging_url.format(page))
    soup = BeautifulSoup(r.text, "html.parser")
    items = soup.select('.iso-item.item-row-view.clearfix')
    for item in items:
        name = item.find('div', class_='col-md-10').get_text().strip().split('\xa0')[0]
        toughness_element = item.find('div', class_='card-power-toughness')
        try:
            toughness = toughness_element.get_text().strip()
        except:
            toughness = None
        cardtype = item.find('div', class_='cardtype').get_text()
        card_dict = {
            "name": name,
            "toughness": toughness,
            "cardtype": cardtype
        }
        return_list.append(card_dict)

    if soup.select('li.next.hidden'): # this element only exists if the last page is reached
        keep_paging = False
        print("[*] Scraper is done. Quitting...")
    else:
        page += 1

# do stuff with your list of dicts - e.g. load it into pandas and save it to a spreadsheet

This will scroll until no more pages exists - no matter how many subpages would be in the site.

My point in the comment above was merely that if you encounter an Exception in your code, your pagecount would never increase. That's probably not what you want to do, which is why I recommended you to learn a little more about the behaviour of the whole try-except-else-finally deal.

This worked really well although there are a lot of things i did not fully understood. Is there any good resource to study more about web scraping because i am a bit overwelmed... Thanks a lot! — Petris, Jun 20 '18 at 13:18
O'Reiley did a good book on webscraping. Take a look at that - http://shop.oreilly.com/product/0636920034391.do — jlaur, Jun 22 '18 at 21:36
I know it has been a long time but I am revisiting this same code. I would like to ask you how did you find the real url that the page uses (https://mtgsingles.gr/search?ajax=products-listing&lang=en&page={}&q=dragon). I use the chrome dev tools and i found the url : https://mtgsingles.gr/search?q=dragon&card%5Bexpansion%5D=&card%5Brarity%5D=&card%5Blanguage%5D=&card%5Bcondition%5D=&card%5Bfoil%5D=&card%5Bcard_type%5D=&card%5Bcard_supertype%5D=&card%5Bcard_subtype%5D=&card%5Bsigned%5D=&card%5Baltered%5D=&card%5Bpower%5D=&card%5Btoughness%5D=&card%5Bconverted_manacost%5D=&yt0=search — Petris, Jul 10 '18 at 16:02

Paul Würtz · Answer 2 · 2018-06-19T15:22:55.980

I am also bluffed, by the request given the same reply, ignoring the page parameter. As a dirty soulution I can offer you first to set up the page-size to a high enough number to get all the Items that you want (this parameter works for some reason...)

  import re
  from math import ceil
  import requests
  from bs4 import BeautifulSoup as soup

  InitUrl = Url = "https://mtgsingles.gr/search"
  NumOfCrawledPages = 0
  URL_Next = ""
  NumOfPages = 2   # depth of pages to be retrieved

  query = "dragon"
  cardSet=set()

  for i in range(1, NumOfPages):
      page_html = requests.get(InitUrl,params={"page":i,"q":query,"page-size":999})
      print(page_html.url)
      page_soup = soup(page_html.text, "html.parser")

      cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})

      for card in cards:
          card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

          if len(card.div.contents) > 3:
              cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
          else:
              cardP_T = "Does not exist"

          cardType = card.contents[3].text
          cardString=card_name + "\n" + cardP_T + "\n" + cardType + "\n"
          cardSet.add(cardString)
          print(cardString)
      NumOfCrawledPages += 1
      print("Moving to page : " + str(NumOfCrawledPages + 1) + " with " +str(len(cards)) +"(cards)\n")

Basically in your script we do not retrieve our information page by page, but the "page-size" parameter includes all the search results in the same page. At least this is what i can understand of it. — Petris, Jun 19 '18 at 19:41
What if we want to make the crawler move from page to page? how can i constract the next url on my own? — Petris, Jun 19 '18 at 19:42
Moreover I believe that in your code the for loop is not needed. — Petris, Jun 19 '18 at 19:50
Ya, it's true, the for loop is unnecesary, right. The page-by-page solution was rigth in theory rigth the way Petris implemented it. — Paul Würtz, Jun 20 '18 at 02:38

Web crawler not able to process more than one webpage

2 Answers2