How can I extract url links from IGN website

Question

I am trying to extract the urls of the reviews on this webpage http://uk.ign.com/games/reviews then open the top 5 in separate tabs

Right now, I have attempted different selections to try pick up the right data but nothing seems to be returning anything. I can't seem to get beyond extracting the urls of each review in the list, let alone opening the first 5 in separate tabs.

I am using Python 3 with Python IDE

Here is my code:

import webbrowser, bs4, requests, re

webPage = requests.get("http://uk.ign.com/games/reviews", headers={'User-
Agent': 'Mozilla/5.0'})

webPage.raise_for_status()

webPage = bs4.BeautifulSoup(webPage.text, "html.parser")

#Me trying different selections to try extract the right part of the page 
webLinks = webPage.select(".item-title")
webLinks2 = webPage.select("h3")
webLinks3 = webPage.select("div item-title")

print(type(webLinks))
print(type(webLinks2))
print(type(webLinks3))
#I think this is where I've gone wrong. These all returning empty lists. 
#What am I doing wrong?


lenLinks = min(5, len(webLinks))
for i in range(lenLinks):
    webbrowser.open('http://uk.ign.com/' + webLinks[i].get('href'))

I can find ALL the links on the web page but I can't extract the links I want. webLinks = webPage.find_all('a') gives me all the links on the page Now I'm trying to extract the links under "item-title" with "h3" class. I've tried webItems = webPage.find_all('a', {'class' : "title"}) webby = webPage.find_all(class_="h3") None of these work, maybe I should use a for loop of some kind? — SeyiA, May 15 '17 at 21:09

score 0 · Accepted Answer · edited May 23 '17 at 12:26

0

Using bs4, BeautifulSoup, and the soup object it returns (which you have as webPage, you can call:

webLinks = webPage.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

find_all returns a list of elements based on their title (in your case, a. These are the HTML elements; to get the links you need to go a step further. You can access an HTML element's attributes (in your case, you want the href) like you would a dict:

for a in soup.find_all('a', href=True):
    print "Found the URL:", a['href']

See BeautifulSoup getting href for more details. Or of course, the docs

ps python is typically written with snake_case rather than CamelCase :)

edited May 23 '17 at 12:26

Community

1
1

answered May 13 '17 at 18:46

Nevermore

7,141
5
42
64

This works, and I was reading the find_all section of the Beautiful Soup doc and was wondering if I need to use find_parents() if I want to target specific links on a web page or should I use a for loop to pull out the links I want from the original find_all('a') statement, the same way you did with a['href']? – SeyiA May 15 '17 at 21:13
Hi! I'm glad it works -- I'm not sure about the next question you have, but I think you're on the right track: `find_parents/children` will return an object with which you can AGAIN call `find_all`... In any case, if this is the answer you're looking for, do mark it as accepted so others can find it later :) – Nevermore May 15 '17 at 21:38

How can I extract url links from IGN website

1 Answers1