1

I am trying to extract the urls of the reviews on this webpage http://uk.ign.com/games/reviews then open the top 5 in separate tabs

Right now, I have attempted different selections to try pick up the right data but nothing seems to be returning anything. I can't seem to get beyond extracting the urls of each review in the list, let alone opening the first 5 in separate tabs.

I am using Python 3 with Python IDE

Here is my code:

import webbrowser, bs4, requests, re

webPage = requests.get("http://uk.ign.com/games/reviews", headers={'User-
Agent': 'Mozilla/5.0'})

webPage.raise_for_status()

webPage = bs4.BeautifulSoup(webPage.text, "html.parser")

#Me trying different selections to try extract the right part of the page 
webLinks = webPage.select(".item-title")
webLinks2 = webPage.select("h3")
webLinks3 = webPage.select("div item-title")

print(type(webLinks))
print(type(webLinks2))
print(type(webLinks3))
#I think this is where I've gone wrong. These all returning empty lists. 
#What am I doing wrong?


lenLinks = min(5, len(webLinks))
for i in range(lenLinks):
    webbrowser.open('http://uk.ign.com/' + webLinks[i].get('href'))
SeyiA
  • 95
  • 6
  • Any luck finding those links? – Nevermore May 14 '17 at 14:32
  • I can find ALL the links on the web page but I can't extract the links I want. webLinks = webPage.find_all('a') gives me all the links on the page Now I'm trying to extract the links under "item-title" with "h3" class. I've tried webItems = webPage.find_all('a', {'class' : "title"}) webby = webPage.find_all(class_="h3") None of these work, maybe I should use a for loop of some kind? – SeyiA May 15 '17 at 21:09

1 Answers1

0

Using bs4, BeautifulSoup, and the soup object it returns (which you have as webPage, you can call:

webLinks = webPage.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find_all returns a list of elements based on their title (in your case, a. These are the HTML elements; to get the links you need to go a step further. You can access an HTML element's attributes (in your case, you want the href) like you would a dict:

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

See BeautifulSoup getting href for more details. Or of course, the docs

ps python is typically written with snake_case rather than CamelCase :)

Community
  • 1
  • 1
Nevermore
  • 7,141
  • 5
  • 42
  • 64
  • This works, and I was reading the find_all section of the Beautiful Soup doc and was wondering if I need to use find_parents() if I want to target specific links on a web page or should I use a for loop to pull out the links I want from the original find_all('a') statement, the same way you did with a['href']? – SeyiA May 15 '17 at 21:13
  • Hi! I'm glad it works -- I'm not sure about the next question you have, but I think you're on the right track: `find_parents/children` will return an object with which you can AGAIN call `find_all`... In any case, if this is the answer you're looking for, do mark it as accepted so others can find it later :) – Nevermore May 15 '17 at 21:38