Why is the Python module requests downloading the HTML page instead of a file?

Question

I have an .xlsx file that I want to download with Python. If I click on the following URL https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx it automatically downloads it with no problems. However, the following code

import requests

url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"

with open("this_is_a_test.xlsx", "wb") as f:
    r = requests.get(url)
    f.write(r.content)
    print(r.ok)

outputs True and downloads the HTML page instead of the xlsx file. What is even more frustrating is that the same code worked perfectly fine before but for some reason changed its behaviour in the last 24h.

This thread and this thread discuss similar problems, however in both cases there is a login barrier which in my case is not present.

EDIT 1: After executing the code above and typing head this_is_a_test.xlsx in my terminal, this is the output I get:

<!DOCTYPE html>
<html lang="en" class="pb-page" data-request-id="fe043004-5c5a-4d2e-a323-cc9b39aa3339"><head data-pb-dropzone="head"><meta name="pbContext" content=";wgroup:string:Publication Websites;page:string:Cookie Absent;website:website:aaas-site" />
<script>AAASdataLayer={"page":{"pageInfo":{"pageTitle":"","pageURL":"https://www.science.org/action/cookieAbsent"},"attributes":{}},"user":{}};if(AAASdataLayer&&AAASdataLayer.user){let match=document.cookie&&document.cookie.match(/(?:^|; )consent=([^;]*)/);if(match){let jsonObj=JSON.parse(decodeURIComponent(match[1]));AAASdataLayer.user.cookieConsent=jsonObj.Marketing?'true':'false';}}</script> <link type="text/css" rel="stylesheet" href="/pb-assets/css/local-1639500397097.css">
<title>AAAS</title>
<meta charset="UTF-8">
<meta name="robots" content="noarchive,noindex,nofollow" />
<meta property="og:title" content="AAAS" />
<meta property="og:type" content="Website" />
<meta property="og:site_name" content="AAAS" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1, user-scalable=0" />

EDIT 2: Okay, so apparently the code downloads the Excel file when executed once, but changes the behaviour when executed a second time. Downloading it manually (by clicking on the link) still works. So, I guess there might still be a workaround for it?

also works for me, opens in an excel file with all the right details like cell A1 Table S5. Cell cycle gene-sets. — Chris Doyle, Dec 19 '21 at 20:18
Ok, running it a second time, I receive a html page. It's possible that on the server is implemented some meassure to avoid multiple downloads from the same ip — Gonzalo Odiard, Dec 19 '21 at 20:22
Thanks for checking @GonzaloOdiard ! Since manually clicking on the link still works, I was wondering if there is a workaround in Python... — chickenNinja123, Dec 19 '21 at 20:34
You can try to set the User-Agent, but no garanties. The site could have more complex rules. https://stackoverflow.com/questions/27652543/how-to-use-python-requests-to-fake-a-browser-visit-a-k-a-and-generate-user-agent — Gonzalo Odiard, Dec 20 '21 at 12:22
@GonzaloOdiard Yes, it worked!! If you put it in an answer I will accept it. — chickenNinja123, Dec 20 '21 at 14:04

score 1 · Accepted Answer · answered Dec 20 '21 at 22:28

A possible solution is add a header User-Agent

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"

with open("this_is_a_test.xlsx", "wb") as f:
    r = requests.get(url, headers=headers)
    f.write(r.content)
    print(r.ok)

score -2 · Answer 2 · answered Dec 19 '21 at 20:31

-2

You need to write in binary instead.This person had a similar problem with urllib2. You can still use requests as long as you write the binary output to a file instead.

More Pythonic code example:

import requests
dls = "https://www.example.com/important.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
    output.write(resp.content)

I tried this and did not have any issues recreating a result.

answered Dec 19 '21 at 20:31

prostagmaProgram

1
1

Thanks for your answer. I am already using binary output and this example is doing pretty much the same thing as the example in the question. – chickenNinja123 Dec 19 '21 at 20:39

Why is the Python module requests downloading the HTML page instead of a file?

2 Answers2