I am playing around with a webpage containing mtg cards and i am trying to extract some information about them.The following program works fine and i am able to crawl throw a page and retrieve all the desirable information:
import re
from math import ceil
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
def NumOfNextPages(TotalCardNum, CardsPerPage):
pages = ceil(TotalCardNum / CardsPerPage)
return pages
URL = "xyz.com"
NumOfCrawledPages = 0
UClient = uReq(URL) # downloading the url
page_html = UClient.read()
UClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# Finds all the cards that exist in the webpage and stores them as a bs4 object
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
CardsPerPage = len(cards)
# Selects the card names, Power and Toughness, Set that they belong
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
# Trying to extract the next URL after this page, but there is not always a next page to retrieve, so an exception(IndexError) is produced due to our tries to access an index in a list that is empty, zero index is not available
try:
URL_Next = "xyz.com/" + page_soup.findAll("li", {"class":
"next"})[0].contents[0].get("href")
except IndexError:
# End of crawling because of IndexError! Means that there is no next
#page to crawl
print("Crawling process completed! No more infomation to retrieve!")
else:
print("The nex t URL is: " + URL_Next + "\n")
NumOfCrawledPages += 1
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
# We need to find the overall card number available, to find the number of
#pages that we need to crawl
# we drag those infomation from a "div" tag with class "summary"
OverallCardInfo = (page_soup.find("div", {"class": "summary"})).text
TotalCardNum = int(re.findall("\d+", OverallCardInfo)[2])
NumOfPages = NumOfNextPages(TotalCardNum, CardsPerPage)
With this I can crawl the first page, which i manually give, and extract some info for the overall number of pages i need to crawl as well as the next url.
Ultimately i would like to give a starting point(webpage) and then the crawler would move into other webpages on its own. So I used the following for loop :
for i in range(0, NumOfPages):
# The number of items shown by the search option on xyz.com can
#not be more than 10000
if ((NumOfCrawledPages + 1) * CardsPerPage) >= 10000:
print("Number of results provided can not exceed 10000!\nEnd of the
crawling!")
break
if i == 0:
Url = InitURL
else:
Url = URL_Next
# opening up connection and crabbing the page
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# Finds all the cards that exist in the webpage and stores them as a bs4
#object
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
# Selects the card names, Power and Toughness, Set that they belong
for card in cards:
card_name =
card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n",
"").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
# Trying to extract the next URL after this page, but there is not our #tries to access an index in a list that is empty, zero index is not available
try:
URL_Next = "xyz.com" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
except IndexError:
# End of crawling because of IndexError! Means that there is no next #page to crawl
print("Crawling process completed! No more infomation to retrieve!")
else:
print("The next URL is: " + URL_Next + "\n")
NumOfCrawledPages += 1
Url = URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
The second code with the additional for loop runs without errors but the result is not what was expected. It returns the crawling results of the first page that i enter manually and it does not proceed further in other pages...
why does this happen?
The expected output is something like:
Dragonspeaker Shaman P/T: 2/2 Creature - Human Barbarian Shaman
Dragonspeaker Shaman P/T: 2/2 Creature - Human Barbarian Shaman
Dragonstalker P/T: 3/3 Creature - Bird Soldier
The next URL is: xyz.com/......
Moving to page : 2
---------------------------------------------end of first page crawling
Dragonspeaker Shaman P/T: 2/2 Creature - Human Barbarian Shaman
Dragonspeaker Shaman P/T: 2/2 Creature - Human Barbarian Shaman
Dragonstalker P/T: 3/3 Creature - Bird Soldier
The next URL is: xyz.com/......
Moving to page : 3
After retrieving this information from the manually given webpage it should go on with the next page saved at Url
variable in the for loop. Instead it continues crawling the same page again and again. The counter works pretty well as it counts the number of pages crawled but the Url
variable seem like it does not change value.