0

I have an .xlsx file that I want to download with Python. If I click on the following URL https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx it automatically downloads it with no problems. However, the following code

import requests

url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"

with open("this_is_a_test.xlsx", "wb") as f:
    r = requests.get(url)
    f.write(r.content)
    print(r.ok)

outputs True and downloads the HTML page instead of the xlsx file. What is even more frustrating is that the same code worked perfectly fine before but for some reason changed its behaviour in the last 24h.


This thread and this thread discuss similar problems, however in both cases there is a login barrier which in my case is not present.


EDIT 1: After executing the code above and typing head this_is_a_test.xlsx in my terminal, this is the output I get:

<!DOCTYPE html>
<html lang="en" class="pb-page" data-request-id="fe043004-5c5a-4d2e-a323-cc9b39aa3339"><head data-pb-dropzone="head"><meta name="pbContext" content=";wgroup:string:Publication Websites;page:string:Cookie Absent;website:website:aaas-site" />
<script>AAASdataLayer={"page":{"pageInfo":{"pageTitle":"","pageURL":"https://www.science.org/action/cookieAbsent"},"attributes":{}},"user":{}};if(AAASdataLayer&&AAASdataLayer.user){let match=document.cookie&&document.cookie.match(/(?:^|; )consent=([^;]*)/);if(match){let jsonObj=JSON.parse(decodeURIComponent(match[1]));AAASdataLayer.user.cookieConsent=jsonObj.Marketing?'true':'false';}}</script> <link type="text/css" rel="stylesheet" href="/pb-assets/css/local-1639500397097.css">
<title>AAAS</title>
<meta charset="UTF-8">
<meta name="robots" content="noarchive,noindex,nofollow" />
<meta property="og:title" content="AAAS" />
<meta property="og:type" content="Website" />
<meta property="og:site_name" content="AAAS" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1, user-scalable=0" />

EDIT 2: Okay, so apparently the code downloads the Excel file when executed once, but changes the behaviour when executed a second time. Downloading it manually (by clicking on the link) still works. So, I guess there might still be a workaround for it?

chickenNinja123
  • 311
  • 2
  • 11

2 Answers2

1

A possible solution is add a header User-Agent

import requests

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

url = "https://www.science.org/doi/suppl/10.1126/science.aad0501/suppl_file/aad0501_table_s5.xlsx"

with open("this_is_a_test.xlsx", "wb") as f:
    r = requests.get(url, headers=headers)
    f.write(r.content)
    print(r.ok)
Gonzalo Odiard
  • 1,238
  • 12
  • 19
-2

You need to write in binary instead.This person had a similar problem with urllib2. You can still use requests as long as you write the binary output to a file instead.

More Pythonic code example:

import requests
dls = "https://www.example.com/important.xls"
resp = requests.get(dls)
with open('test.xls', 'wb') as output:
    output.write(resp.content)



I tried this and did not have any issues recreating a result.

  • Thanks for your answer. I am already using binary output and this example is doing pretty much the same thing as the example in the question. – chickenNinja123 Dec 19 '21 at 20:39