Python crawler (bs4, urlopen) malfunctions

Question

I am playing around with a webpage containing mtg cards and i am trying to extract some information about them.The following program works fine and i am able to crawl throw a page and retrieve all the desirable information:

import re
from math import ceil
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

def NumOfNextPages(TotalCardNum, CardsPerPage):
    pages = ceil(TotalCardNum / CardsPerPage)
    return pages

URL = "xyz.com"
NumOfCrawledPages = 0

UClient = uReq(URL)  # downloading the url
page_html = UClient.read()
UClient.close()

# html parsing
page_soup = soup(page_html, "html.parser")


# Finds all the cards that exist in the webpage and stores them as a bs4 object
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
CardsPerPage = len(cards)


# Selects the card names, Power and Toughness, Set that they belong
for card in cards:

    card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

    if len(card.div.contents) > 3:
        cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
    else:
        cardP_T = "Does not exist"

    cardType = card.contents[3].text
    print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")

# Trying to extract the next URL after this page, but there is not always a next page to retrieve, so an exception(IndexError) is produced due to our tries to access an index in a list that is empty, zero index is not available
try:
    URL_Next = "xyz.com/" + page_soup.findAll("li", {"class": 
"next"})[0].contents[0].get("href")
except IndexError:
    # End of crawling because of IndexError! Means that there is no next 
#page to crawl
    print("Crawling process completed! No more infomation to retrieve!")
else:
    print("The nex t URL is: " + URL_Next + "\n")
    NumOfCrawledPages += 1
finally:
    print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

# We need to find the overall card number available, to find the number of 
#pages that we need to crawl
# we drag those infomation from a "div" tag with class "summary"

OverallCardInfo = (page_soup.find("div", {"class": "summary"})).text
TotalCardNum = int(re.findall("\d+", OverallCardInfo)[2])
NumOfPages = NumOfNextPages(TotalCardNum, CardsPerPage)

With this I can crawl the first page, which i manually give, and extract some info for the overall number of pages i need to crawl as well as the next url.

Ultimately i would like to give a starting point(webpage) and then the crawler would move into other webpages on its own. So I used the following for loop :

for i in range(0, NumOfPages):
    # The number of items shown by the search option on xyz.com can 
    #not be more than 10000
    if ((NumOfCrawledPages + 1) * CardsPerPage) >= 10000:
        print("Number of results provided can not exceed 10000!\nEnd of the 
crawling!")
        break

    if i == 0:
       Url = InitURL
    else:
        Url = URL_Next

    # opening up connection and crabbing the page
    UClient = uReq(Url)  # downloading the url
    page_html = UClient.read()
    UClient.close()

    # html parsing
    page_soup = soup(page_html, "html.parser")

    # Finds all the cards that exist in the webpage and stores them as a bs4 
#object
    cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})

    # Selects the card names, Power and Toughness, Set that they belong
    for card in cards:

        card_name = 
card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")

        if len(card.div.contents) > 3:
            cardP_T = card.div.contents[3].contents[1].text.replace("\n", 
"").strip()
        else:
            cardP_T = "Does not exist"

        cardType = card.contents[3].text
        print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")

    # Trying to extract the next URL after this page, but there is not our #tries to access an index in a list that is empty, zero index is not available
    try:
        URL_Next = "xyz.com" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
    except IndexError:
        # End of crawling because of IndexError! Means that there is no next #page to crawl
        print("Crawling process completed! No more infomation to retrieve!")
    else:
        print("The next URL is: " + URL_Next + "\n")
        NumOfCrawledPages += 1
        Url = URL_Next
    finally:
        print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")

The second code with the additional for loop runs without errors but the result is not what was expected. It returns the crawling results of the first page that i enter manually and it does not proceed further in other pages...

why does this happen?

The expected output is something like:

Dragonspeaker Shaman P/T: 2/2 Creature - Human Barbarian Shaman

Dragonstalker P/T: 3/3 Creature - Bird Soldier

The next URL is: xyz.com/......

Moving to page : 2

---------------------------------------------end of first page crawling

Dragonspeaker Shaman P/T: 2/2 Creature - Human Barbarian Shaman

Dragonstalker P/T: 3/3 Creature - Bird Soldier

The next URL is: xyz.com/......

Moving to page : 3

After retrieving this information from the manually given webpage it should go on with the next page saved at Url variable in the for loop. Instead it continues crawling the same page again and again. The counter works pretty well as it counts the number of pages crawled but the Url variable seem like it does not change value.

Have you tried removing the last line of the else clause in the try-except block? Also, use better naming conventions [here](https://www.python.org/dev/peps/pep-0008/#naming-conventions) or [here](https://stackoverflow.com/a/160830/5645103) that can help both you and everyone understand your code — cwahls, May 20 '18 at 18:02
I have removed it and nothing changed.. Everything is running without errors but not as expected. Thanks for your reccomendations on variable naming. I will look into it! — Petris, May 20 '18 at 18:28

Python crawler (bs4, urlopen) malfunctions

0 Answers0