0

I'm testing this code, to try to download around 120 Excel files from one URL.

import requests
from bs4 import BeautifulSoup
headers={"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"}
resp = requests.get("https://healthcare.ascension.org/price-transparency/price-transparency-files",headers=headers)
soup = BeautifulSoup(resp.text,"html.parser")

for link in soup.find_all('a', href=True):
    if 'xls' in link['href']:
        print(link['href'])
        url="https://healthcare.ascension.org"+link['href']
        data=requests.get(url)
        print(data)
        output = open(f'C:/Users/ryans/Downloads/{url.split("/")[-1].split(".")[0]}.xls', 'wb')
        output.write(data.content)
        output.close()

This line: data=requests.get(url) Always geive me Response [406] results. Apparently, HTTP 406 is a status of "Not Acceptable" per HTTP.CAT and Mozilla. Not sure what is wrong here, but I think I should have 120 Excel files, with data, downloaded. Now, I am getting 120 Excel files on my laptop, but none of the files have any data in them.

ASH
  • 20,759
  • 19
  • 87
  • 200
  • 1
    try passing headers as `data=requests.get(url,headers=headers)` – Abhishek Jan 16 '22 at 15:32
  • Also inspect the `data.text` – Abdul Niyas P M Jan 16 '22 at 15:33
  • And take a look on [`Referer`](https://developer.mozilla.org/ru/docs/Web/HTTP/Headers/Referer) and [`Cookie`](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cookie) headers, probably it'd be better to use same [`requests.Session`](https://docs.python-requests.org/en/latest/user/advanced/#session-objects) for all requests. You can check [my answer](https://stackoverflow.com/a/55919705/10824407) to similar question, there's an explanation how to sniff actual browser request. – Olvin Roght Jan 16 '22 at 15:34
  • Seems like data=requests.get(url,headers=headers) did the trick. What does 'headers=headers' do, exactly? – ASH Jan 16 '22 at 15:42
  • @ASH, what does passing parameter to method does? :D It forces `requests` to include headers. Check the docs for more detailed explanation. And I still recommend you to initialize `Session`, prepare it and use for all requests. – Olvin Roght Jan 16 '22 at 15:44
  • Some of the *a* tags have HREFs that are relative, some are absolute and some are *javascript:void(0);*. You need to check for those conditions and process accordingly. Also, not all of the HREFs lead to xls(s) type files. You should check for that too – DarkKnight Jan 16 '22 at 15:48
  • @Olvin, thanks, I will look into initialize Session later today. Also, thanks JCeasar, I think I know what you mean, I need to dig in, and look at this closer, probably this afternoon. – ASH Jan 16 '22 at 15:51

2 Answers2

1

This website seems to filter user-agent, so as you did set the header in your dictionnary, you just need to pass it to request when invoking the get method :

requests.get(url, headers=headers)

It seems that only the user-agent is checked.

Sami Tahri
  • 1,077
  • 9
  • 19
1

The HTTP 406 error arises when no User-Agent is specified.

Once that's been resolved and the HREFs appropriately parsed (for format and relevance) then the OP's code should work. However, it's going to be very slow because the XLSX files being acquired have size measured in many MB.

Therefore, matters can be improved greatly with a multithreaded approach as follows:

import requests
from bs4 import BeautifulSoup as BS
from concurrent.futures import ThreadPoolExecutor
import os
import re
import sys

HEADERS = {'User-Agent': 'PostmanRuntime/7.29.0'}
TARGET = '<Your download folder>'
HOST = 'https://healthcare.ascension.org'
XLSX = re.compile('.*xlsx$')

def download(url):
    try:
        base = os.path.basename(url)
        print(f'Processing {base}')
        (r := requests.get(url, headers=HEADERS, stream=True)).raise_for_status()
        with open(os.path.join(TARGET, base), 'wb') as xl:
            for chunk in r.iter_content(chunk_size=16*1024):
                xl.write(chunk)
    except Exception as e:
        print(e, file=sys.stderr)


(r := requests.get(f'{HOST}/price-transparency/price-transparency-files', headers=HEADERS)).raise_for_status()
soup = BS(r.text, 'lxml')
with ThreadPoolExecutor() as executor:
    executor.map(download, [HOST + link['href'] for link in soup.find_all('a', href=XLSX)])
print('Done')

Note:

Python 3.8+ required

DarkKnight
  • 19,739
  • 3
  • 6
  • 22
  • It's not stable multithread approach, any exception will stop an execution. – Olvin Roght Jan 17 '22 at 08:12
  • @JCaesar, it's your job. Stable approach you can find in [official docs](https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor-example). Also it's not necessary to initialize `Session` in `with` clause, you can make it in code. – Olvin Roght Jan 17 '22 at 08:22