Torrent page parsing fails

Question

I am trying to parse with BeautifulSoup the movie page from rarbg.to . I am trying to collect the titles of the movies.

So my code in Python is the following:

import urllib2
from bs4 import BeautifulSoup
url = "https://rarbg.to/torrents.php?category=movies"

hdr = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
    'Accept-Encoding': 'none',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive',
}
req = urllib2.Request(url, headers=hdr)

try:
    page = urllib2.urlopen(req)
except urllib2.HTTPError, e:
    print e.fp.read()

# Get all the HTML page
raw_content = page.read()
# print raw_content #debug

# Pass the html page to BeautifulSoup
soup = BeautifulSoup(raw_content)
print soup #debug

movie_titles = soup.find_all("tr","lista2")
print movie_titles

When I first run it, it printed correctly a list of movie elements (the table rows).

But when I tried multiple times after that, it returns this:

<html><head>
</head>
<body>
<style type="text/css">a,abbr,acronym,address,applet,article,aside,audio,b,big,blockquote,body,canvas,caption,center,cite,code,dd,del,details,dfn,div,dl,dt,em,fieldset,figcaption,figure,footer,form,h1,h2,h3,h4,h5,h6,header,hgroup,html,i,iframe,img,ins,kbd,label,legend,li,mark,menu,nav,object,ol,p,pre,q,s,samp,section,small,span,strike,strong,sub,summary,sup,table,tbody,td,tfoot,th,thead,time,tr,tt,u,ul,var,video{margin:0;padding:0;border:0;outline:0;font:inherit;vertical-align:baseline}article,aside,details,figcaption,figure,footer,header,hgroup,menu,nav,section{display:block}body{line-height:1}ol,ul{list-style:none}blockquote,q{quotes:none}blockquote:after,blockquote:before,q:after,q:before{content:'';content:none}ins{text-decoration:none}del{text-decoration:line-through}table{border-collapse:collapse;border-spacing:0}
body {
    background: #000 url("//dyncdn.me/static/20/img/bknd_body.jpg") repeat-x scroll 0 0 !important;
    font: 400 8pt normal Tahoma,Verdana,Arial,Arial  !important;
}
.button {
    background-color: #3860bb;
    border: none;
    color: white;
    padding: 15px 32px;
    text-align: center;
    text-decoration: none;
    display: inline-block;
    font-size: 16px;
    cursor: pointer;
    text-transform: none;
    overflow: visible;
}
.content-rounded {
    background: #fff none repeat scroll 0 0 !important;
    border-radius: 3px;
    color: #000 !important;
    padding: 20px;
    width:961px;
}
</style><div align="center" style="margin-top:20px;padding-top:20px;color: #000 !important;">
<div class="content-rounded" style="color: #000 !important;">
<img src="//dyncdn.me/static/20/img/logo_dark_nodomain2_optimized.png"/><br/>Please wait while we try to verify your browser...<br/>If you are stuck on this page disable your browser addons<br/><img src="//dyncdn.me/static/20/img/loading_flat.gif"/>
</div>
</div>
<script>
var w = window.innerWidth || document.documentElement.clientWidth || document.body.clientWidth;
var h = window.innerHeight || document.documentElement.clientHeight || document.body.clientHeight;
var days = 7;
var date = new Date();
var name = 'sk';
var value_sk = 'iqcdg1oe63';
date.setTime(date.getTime()+(days*24*60*60*1000));
var expires = ";expires="+date.toGMTString();
document.cookie = name+"="+value_sk+expires+"; path=/";

if(w < 100 || h < 100) {
    window.location.href = "/threat_defence.php?defence=nojc&r=54677187";
} else {
    if(!document.domain) { var ref_cookie = ''; } else { var ref_cookie = document.domain; }
    setTimeout(function(){
        window.location.href = "/threat_defence.php?defence=2&sk="+value_sk+"&ref_cookie="+ref_cookie+"&r=74070547";
    }, 3000);
}
</script>
</body></html>
[]

Process finished with exit code 0

As I can understand this Please wait while we try to verify your browser...<br/>If you are stuck on this page disable your browser addons has to do with the problem.

Is it some kind of precaution to DDOS attacks, or captcha? I am only making one or two requests per minute or so, during development.

Read the TOS... *You are not permitted to, and you warrant and agree that you will not do or facilitate any of the following... (7) **use any robot, spider, web crawler, other automatic device, or manual process to copy our web pages, torrents, or other content contained without our prior expressed written permission***. https://rarbg.to/useragreement.php — OneCricketeer, Mar 20 '17 at 16:26

innicoder · Accepted Answer · 2017-03-20T20:04:30.057

It's not a DDOS protection you'd be blocked \ filtered. The problem here is that they use other kinds of confirmation for your browser to determine if you're human ( like captcha ). As you can see here it gives you a redirect to another page (human browser will auto-follow contrary to your script.)

Now you're probably looking for possible solutions to this problem. Here are a few:

Implementing a wait time before each request (you can use import time , time.sleep(seconds))
Using Selenium - 'Selenium automates browsers. That's it! What you do with that power is entirely up to you.' - My recommendation.
Proxy or other identity scrambling solutions.

Selenium - It's a fake browser- 2017 - ME. It has methods like waiting until EC.presence_of_element_located((By.ID, "myDynamicElement")) http://selenium-python.readthedocs.io/waits.html Therefore you can program it to mimic human behavor.

Thanks . I have used heavily selenium but I didn't want to use it. I am putting together a POC pet project. Nothing major or harmful. I will look into the proxy solution, even if I think it is overkill because I will just run the script one-two times per day or two. — Kostas Demiris, Mar 20 '17 at 23:14
Yeah I know that feeling, I hate using it too, not because it's bad but it doesn't feel like a script. You're not making a bot, you're making a human. — innicoder, Mar 20 '17 at 23:21

score 1 · Answer 2 · edited May 23 '17 at 11:54

1

I had to do a lot of requests to the website to reproduce this. Looks like my ip is now blocked.

Consider using something like TOR, or a vpn to change your IP after a couple tries.

edited May 23 '17 at 11:54

Community

1
1

answered Mar 20 '17 at 16:18

Fernando Cezar

858
7
22

Torrent page parsing fails

2 Answers2