Building a python web scraper, Need help to get correct output

Question

I was building a web-scraper using python. The purpose of my scraper is to fetch all the links to websites from this webpage http://www.ebizmba.com/articles/torrent-websites

I want output like -

www.thepiratebay.se
www.kat.ph

I am new to python and scraping, and I was doing this just for practice. Please help me to get the right output.

My code --------------------------------------

import requests

from bs4 import BeautifulSoup

r = requests.get("http://www.ebizmba.com/articles/torrent-websites")

soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})
for item in data:
    print(item.contents[1].find_all("a"))

My Output --- https://i.stack.imgur.com/Xi37B.png

What output do you get now? – Emil Vikström Dec 03 '15 at 09:23 — Emil Vikström, Dec 03 '15 at 09:23

score 0 · Answer 1 · answered Dec 03 '15 at 09:36

Use .get('href') like this:

import requests    
from bs4 import BeautifulSoup

r = requests.get("http://www.ebizmba.com/articles/torrent-websites")

soup = BeautifulSoup(r.text, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})

for i in data:
    for j in i.contents[1].find_all("a"):
        print(j.get('href'))

Full output:

http://www.thepiratebay.se
http://siteanalytics.compete.com/thepiratebay.se
http://quantcast.com/thepiratebay.se
http://www.alexa.com/siteinfo/thepiratebay.se/
http://www.kickass.to
http://siteanalytics.compete.com/kickass.to
http://quantcast.com/kickass.to
http://www.alexa.com/siteinfo/kickass.to/
http://www.torrentz.eu
http://siteanalytics.compete.com/torrentz.eu
http://quantcast.com/torrentz.eu
http://www.alexa.com/siteinfo/torrentz.eu/
http://www.extratorrent.cc
http://siteanalytics.compete.com/extratorrent.cc
http://quantcast.com/extratorrent.cc
http://www.alexa.com/siteinfo/extratorrent.cc/
http://www.yify-torrents.com
http://siteanalytics.compete.com/yify-torrents.com
http://quantcast.com/yify-torrents.com
http://www.alexa.com/siteinfo/yify-torrents.com
http://www.bitsnoop.com
http://siteanalytics.compete.com/bitsnoop.com
http://quantcast.com/bitsnoop.com
http://www.alexa.com/siteinfo/bitsnoop.com/
http://www.isohunt.to
http://siteanalytics.compete.com/isohunt.to
http://quantcast.com/isohunt.to
http://www.alexa.com/siteinfo/isohunt.to/
http://www.sumotorrent.sx
http://siteanalytics.compete.com/sumotorrent.sx
http://quantcast.com/sumotorrent.sx
http://www.alexa.com/siteinfo/sumotorrent.sx/
http://www.torrentdownloads.me
http://siteanalytics.compete.com/torrentdownloads.me
http://quantcast.com/torrentdownloads.me
http://www.alexa.com/siteinfo/torrentdownloads.me/
http://www.eztv.it
http://siteanalytics.compete.com/eztv.it
http://quantcast.com/eztv.it
http://www.alexa.com/siteinfo/eztv.it/
http://www.rarbg.com
http://siteanalytics.compete.com/rarbg.com
http://quantcast.com/rarbg.com
http://www.alexa.com/siteinfo/rarbg.com/
http://www.1337x.org
http://siteanalytics.compete.com/1337x.org
http://quantcast.com/1337x.org
http://www.alexa.com/siteinfo/1337x.org/
http://www.torrenthound.com
http://siteanalytics.compete.com/torrenthound.com
http://quantcast.com/torrenthound.com
http://www.alexa.com/siteinfo/torrenthound.com/
https://demonoid.org/
http://siteanalytics.compete.com/demonoid.pw
http://quantcast.com/demonoid.pw
http://www.alexa.com/siteinfo/demonoid.pw/
http://www.fenopy.se
http://siteanalytics.compete.com/fenopy.se
http://quantcast.com/fenopy.se
http://www.alexa.com/siteinfo/fenopy.se/

Hey, thanks @kevin .. Can you tell me that how can I get more refined output like this --- http://www.thepiratebay.se http://www.kickass.to http://www.torrentz.eu .... — Elliot Anderson, Dec 03 '15 at 09:52

576i · Accepted Answer · 2015-12-03T09:51:52.817

0

If you are webscraping for practice, have a look at regular expressions. This here would get just the headline links... The Needle string is the match string, the brackets (http://.*?) contain the match group.

import urllib2
import re

myURL = "http://www.ebizmba.com/articles/torrent-websites"
req = urllib2.Request(myURL)

Needle1 = '<p><a href="(http:.*?)" rel="nofollow" target="_blank">'
for match in re.finditer(Needle1, urllib2.urlopen(req).read()):
   print(match.group(1))

edited Dec 03 '15 at 09:51

answered Dec 03 '15 at 09:44

576i

7,579
12
55
92

[**Use RegEx parse HTML is a bad idea.**](http://stackoverflow.com/a/1732454/5299236) – Remi Guan Dec 03 '15 at 09:56
@Kevin, agreed that in many cases parsing HTML with RegEx can be a bad idea, because you have to validate your results to make sure that you don't get nasty surprises. On the other hand you get quick results - so it all depends on the use (and re-use) case. – 576i Dec 03 '15 at 10:38

Building a python web scraper, Need help to get correct output

2 Answers2