33
def download_torrent(url):
    fname = os.getcwd() + '/' + url.split('title=')[-1] + '.torrent'
    try:
        schema = ('http:')
        r = requests.get(schema + url, stream=True)
        with open(fname, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024):
                if chunk:
                    f.write(chunk)
                    f.flush()
    except requests.exceptions.RequestException as e:
        print('\n' + OutColors.LR + str(e))
        sys.exit(1)

    return fname

In that block of code I am getting an error when I run the full script. When I go to actually download the torrent, I get:

('Connection aborted.', BadStatusLine("''",))

I only posted the block of code that I think is relevant above. The entire script is below. It's from pantuts, but I don't think it's maintained any longer, and I am trying to get it running with python3. From my research, the error might mean I'm using http instead of https, but I have tried both.

Original script

ballade4op52
  • 2,142
  • 5
  • 27
  • 42
eurabilis
  • 331
  • 1
  • 3
  • 7
  • Could you provide a sample url where this happens? – TobiMarg Oct 16 '15 at 16:20
  • The code you pasted is missing a `try`. I'm getting a different error: `('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))` Hope a more descriptive error helps you. – midrare Oct 16 '15 at 16:29
  • hmm. The script when running does not give me the url just the torrent name so I cant post a sample url. I just searched for learning python and selected the first torrent. I am not sure what you mean by missing a try. Can you elaborate? Thanks for your help. – eurabilis Oct 16 '15 at 16:35
  • The code snippet you pasted has an `except`, but not a `try`. It looks like the code in your github repo does though. I've highlighted the line I'm referring to here: https://github.com/pantuts/asskick/blob/master/asskick.py#L42 – midrare Oct 16 '15 at 16:45
  • Good catch I missed that. I must have taken it out when I took out the link to stackoverflow to keep things neat. The actual code I am running has the try: in it though and I still get the same badstatusline error unfortunately – eurabilis Oct 16 '15 at 16:50

3 Answers3

52

The error you get indicates the host isn't responding in the expected manner. In this case, it's because it detects that you're trying to scrape it and deliberately disconnecting you.

If you try your requests code with this URL from a test website: http://mirror.internode.on.net/pub/test/5meg.test1, you'll see that it downloads normally.

To get around this, fake your user agent. Your user agent identifies your web browser, and web hosts commonly check it to detect bots.

Use the headers field to set your user agent. Here's an example which tells the webhost you're Firefox.

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0' }
r = requests.get(url, headers=headers)

There are lots of other discrepancies1 between bots and human-operated browsers that web hosts can check for, but user agent is one of the easiest and common ones.

If you want your scraper to be harder to detect, you'll want to use a headless browser like headless Chrome2 (or ghost.py if you want to stick with Python), which you can trust will behave like a real browser (because it is!).


Footnotes:

1Possible other checks include checks for if images aren't being downloaded, page resources aren't downloaded in the normal order, pages being downloaded faster than a human can read them, and cookies not being set properly. Google flags mouse movements deemed insufficiently human-like.

2Headless Chrome is the most competent headless browser in 2018, but if its weight is a problem for you, its slightly-outdated predecessors, PhantomJS and ghost.py, are lighter weight and still usable.

midrare
  • 2,371
  • 28
  • 48
2

try this:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0',
    'ACCEPT' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'ACCEPT-ENCODING' : 'gzip, deflate, br',
    'ACCEPT-LANGUAGE' : 'ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7',
    'REFERER' : 'https://www.google.com/'
}

    r = requests.get("http://yourdomain.com/", headers=headers)
Mkurbanov
  • 197
  • 3
  • 13
1

In my case, i must remove the user agent fields from headers

url='https://...'
headers = {}
requests.get(url, headers=headers)

once i set 'User-Agent', it getting ('Connection aborted.', BadStatusLine("''",)) and this error occurs only with the individual site. my first post,i get many helps from this site, hope it can help others who find here

M.ison
  • 11
  • 2