1

Problem:

Don't know if google fu is failing me again but I am unable to download csvs from a list of urls. I have used requests and bs4 to gather the urls (the final list is correct) - see process below for more info.

I then followed one of the answers given here using urllib to download: Trying to download data from URL with CSV File, as well as a number other stackoverflow python answers for downloading csvs.

Currently I am stuck with an

HTTP Error 404: Not Found

(below stack trace is from last attempt where passing User-Agent)

----> 9 f = urllib.request.urlopen(req)
     10 print(f.read().decode('utf-8'))
     #other lines

--> 650         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    651 
    652 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

I tried the solution here of adding a User-Agent: Web Scraping using Python giving HTTP Error 404: Not Found , though I would have expected a 403 not 404 error code - but seems to have worked for a number of OPs.

This still failed with same error. I am pretty sure I can solve this by simply using selenium and passing the csv urls to .get but I want to know if I can solve this with requests alone.


Outline:

I visit this this page:

https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice

I grab all the monthly version links e.g. Patients Registered at a GP Practice May 2019, I then visit each of those pages and grab all the csv links within.

I loop the final dictionary of filename:download_url pairs attempting to download the files.


Question:

Can anyone see what I am doing wrong or how to fix this so I can download the files without resorting to selenium? I'm also unsure of the most efficient way to accomplish this - perhaps urllib is not actually required at all and just requests will suffice?


Python:

Without user-agent:

import requests
from bs4 import BeautifulSoup as bs
import urllib

base = 'https://digital.nhs.uk/'
all_files = []

with requests.Session() as s:
    r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
    soup = bs(r.content, 'lxml')
    links = [base + item['href'] for item in soup.select('.cta__button')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}
        if file_links:
            all_files.append(file_links)  #ignore empty dicts as for some months there is no data yet
        else:
            print('no data : ' + link)

all_files = {k: v for d in all_files for k, v in d.items()}  #flatten list of dicts to single dict


path = r'C:\Users\User\Desktop'

for k,v in all_files.items():
    #print(k,v)
    print(v)
    response = urllib.request.urlopen(v)
    html = response.read()

    with open(path + '\\' + k + '.csv', 'wb') as f:
        f.write(html)
    break  #as only need one test case

Test with adding User-Agent:

req = urllib.request.Request(
    v, 
    data=None, 
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • Use something like Fiddler to check that the request you're making is (a) correct (are you mangling the filename somehow?) and (b) matches the request you'd make for the same file via a browser. I don't think the `User-Agent` header is relevant in this instance. I picked on https://files.digital.nhs.uk/0D/65E837/gp-reg-pat-prac-sing-age-male.csv ("Patients Registered at a GP Practice – May 2019: Single year of age (GP practice-males)") and was able to download it with every single header edited out other than `Host` via Fiddlers Replay > Reissue and Edit – Rob May 30 '19 at 10:54
  • 1
    @Rob Thanks. I also don't think user-agent is relevant partic given error code but wanted to show my efforts. I will see if dev tools captures anything I can use. I checked by pasting filename into browser and it downloaded fine. – QHarr May 30 '19 at 10:58
  • 1
    But didn't check concatenated! @Rob – QHarr May 30 '19 at 12:02

1 Answers1

1

looking at the values, it's showing me for your links

https://digital.nhs.uk/https://files.digital.nhs.uk/publicationimport/pub13xxx/pub13932/gp-reg-patients-04-2014-lsoa.csv

I think you want to drop the base +, so use this:

file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}

instead of:

file_links = {item.text.strip().split('\n')[0]:base + item['href'] for item in soup.select('[href$=".csv"]')}

Edit: Full Code:

import requests
from bs4 import BeautifulSoup as bs

base = 'https://digital.nhs.uk/'
all_files = []

with requests.Session() as s:
    r = s.get('https://digital.nhs.uk/data-and-information/publications/statistical/patients-registered-at-a-gp-practice')
    soup = bs(r.content, 'lxml')
    links = [base + item['href'] for item in soup.select('.cta__button')]

    for link in links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        file_links = {item.text.strip().split('\n')[0]:item['href'] for item in soup.select('[href$=".csv"]')}
        if file_links:
            all_files.append(file_links)  #ignore empty dicts as for some months there is no data yet
        else:
            print('no data : ' + link)

all_files = {k: v for d in all_files for k, v in d.items()}  #flatten list of dicts to single dict

path = 'C:/Users/User/Desktop/'

for k,v in all_files.items():
    #print(k,v)
    print(v)
    response = requests.get(v)
    html = response.content

    k = k.replace(':', ' -')
    file = path + k + '.csv'

    with open(file, 'wb' ) as f:
        f.write(html)
    break  #as only need one test case
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • 1
    Thanks. I am getting an unknown filetype downloaded. Can you see why my .csv if not working? Happy to ask follow-up if required. An example download passed in the with at end would be C:\Users\User\Desktop\Patients Registered at a GP Practice - May 2019: Totals (GP practice-all persons).csv – QHarr May 30 '19 at 12:10
  • not sure what OS you;re working with, but I'm on windows and you can't have `:` in the filename. Check the solution above, I put the whole code that seems to be working for me. All I did was replace `:` with ` -` in the k variable. Be careful to not replace it in the path, as the `:` would be required for that – chitown88 May 30 '19 at 12:14
  • 1
    Works beautifully. Thank you. I am fairly new to python - do you know of any resources that talk about which csv download method is more efficient? urllib seemed common in the research I did on downloads. – QHarr May 30 '19 at 12:18
  • to be honest, not sure if there is a more efficient way. Personally for me, I usually go with `requests` as opposed to `urllib`, but I don't think there's a huge difference as both are just going to grab the html content. – chitown88 May 30 '19 at 12:20
  • @QHarr, you're new to Python? I see you're solutions all the time on here related to python, web-scraping, etc? – chitown88 May 30 '19 at 12:21
  • 1
    only since end of last year I most rely on css /html/xhr knowledge and yes scraping in other languages. My python is not very advanced sadly. – QHarr May 30 '19 at 12:22
  • 1
    ahh ok. Your solutions were always so nice, so I just assumed you been doing this for a while! Keep up the good work! – chitown88 May 30 '19 at 12:23