2

I am getting a 403 error on trying to download a file from cloudfront using python. My code is shown below:

import requests
import shutil


headers = {
    'authority': 'd2mgevdyeotxc9.cloudfront.net',
    'origin': 'https://www.djcity.com',
    'upgrade-insecure-requests': '0',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'sec-fetch-site': 'cross-site',
    'referer': 'https://www.djcity.com/digital/team-salut-wagon----67677.htm',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-AU,en-GB;q=0.9,en-US;q=0.8,en;q=0.7',
}

url = 'https://d2mgevdyeotxc9.cloudfront.net/6/7/6/7/7/zohujeaftuoduhba.mp3?response-content-disposition=attachment%3bfilename%3d%22Team%2520Salut%2520-%2520Wagon%2520%28Intro%29.mp3%22%3bfilename%2a%3dUTF-8%27%27Team%2520Salut%2520-%2520Wagon%2520%28Intro%29.mp3&Expires=1570451804&Signature=RPS7J02WY~eHDGYQHHSel9v~bpRolXl~WfgZjO21BrFhLuQQGkBSKkViD~-~K9DGPxmK0yj6-3zVrjrj1nLDMTelKQcbGk8tqyTBgbyI8NGMI9neR~wdtydKedxFp0a3YRHoM03boPFqwKKnKl5PcVxle1tCoXODydUq6Uj-iFxfYfpj10V0oJCa1Kv6SqWrrh1AGnW9CZOSPTlokjWzH6QS7DzBPN-0JacZDkLtG7wnnpcAT9Woj6h3YkdqMlfPpugxauUMXPxFg9sao-IG0BI4SyLeKCzZbJAqwAJruDoVZ1lKOwzgruDkCNPYERfpUytiOMdnbboL9lro4rvymg__&Key-Pair-Id=APKAJLJPFMSLJDIGQDHA'
with requests.get(url, stream=True) as r:
    with open('track1.mp3', 'wb') as f:
        shutil.copyfileobj(r.raw, f)

I get the following response headers:

'Content-Type': 'text/html', 
'Content-Length': '228',
'Connection': 'keep-alive', 
'Date': 'Thu, 03 Oct 2019 19:00:44 GMT',
'Last-Modified': 'Mon, 25 Sep 2017 19:02:17 GMT', 
'ETag': '53954bb03f2f3597aa5025deb69ca9b4', 
'Accept-Ranges': 'bytes',
'Server': 'AmazonS3', 
'X-Cache': 'Error from cloudfront', 
'Via': '1.1 934dd0fb722aa582f1b4a3cdae35b12d.cloudfront.net (CloudFront)', 
'X-Amz-Cf-Pop': 'SIN2-C1',
'X-Amz-Cf-Id': '4irBaWV9o-9_kQN0bNHFJydrIiuZxtVLZ36Oc5vDXX1AE76iTAbDww=='

As an example, a successful response header looks something like this:

'accept-ranges': 'bytes',

'content-disposition': 'attachment;filename="Carisma%20-%20Sample.mp3";filename*=UTF-8''Carisma%20-%20Sample.mp3',
'content-length': '7805066',
'content-type': 'audio/mpeg',
'date': 'Tue, 08 Oct 2019 00:01:06 GMT',
'etag': '83244899c9910bcf0f10f5065293b709',
'last-modified': 'Thu, 03 Oct 2019 01:03:52 GMT',
'server': 'AmazonS3',
'status': '200',
'via': '1.1 a84eb604396158af577c875ac569048a.cloudfront.net (CloudFront)',
'x-amz-cf-id': 'one65QPSUx5IB0_JMtinKzLles7vSchJXSz7ddx9auSPmbtqJ0Doug==',
'x-amz-cf-pop': 'SIN2-C1',
'x-cache': 'Miss from cloudfront'

I'm not sure why its not working as analysing the network shows I've sent all the required headers. How can I get these sort of requests working?

The response content seems like a redirect is happening: b'<html><head><meta http-equiv="refresh" content="0;URL=http://www.djcity.com/digital/record-pool.aspx?m=3"><script>window.location.replace("http://www.djcity.com/digital/record-pool.aspx?m=3");</script></head><body></body></html>'

West
  • 2,350
  • 5
  • 31
  • 67
  • duplicate of https://stackoverflow.com/questions/49087990/python-request-being-blocked-by-cloudflare/ – Sawant Sharma Oct 07 '19 at 12:56
  • Cloudflare is very, very, very protective against scraping – Maurice Meyer Oct 07 '19 at 14:35
  • Everything here seems to be working as designed. Do you have permission to scrape this site? If not, then don't do it. – Michael - sqlbot Oct 07 '19 at 23:47
  • 2
    Honestly why do people just downvote for no reason?:( – West Oct 07 '19 at 23:48
  • @Michael-sqlbot yes I have permissions as its the only way I can get the cloudfront url. Its a paid site. – West Oct 07 '19 at 23:50
  • Anyone who might think im doing something illegal its not as I have the permission to download the files as I have paid for the service, using python just makes my life easier. I'm doing this with selenium anyways but obviously something like requests is better – West Oct 07 '19 at 23:53
  • 1
    @Saawant hows this a duplicate? Your link is for Cloudflare, and thats not the same thing as Cloudfront – West Oct 07 '19 at 23:59
  • 1
    @West, this might be relevant: https://stackoverflow.com/questions/6549787/getting-started-with-secure-aws-cloudfront-streaming-with-python – jDo Oct 08 '19 at 00:18
  • @jDo Thanks for the link, seems like there's no way getting around this without interaction with a browser – West Oct 08 '19 at 00:30
  • 1
    Maybe it can be done using boto but I haven't looked into it. Anyway, I noticed that the signed URL in your question contains `&Expires=1570451804` which translates to `datetime.datetime(2019, 10, 7, 12, 36, 44)` UTC. Perhaps the URL has simply expired? – jDo Oct 08 '19 at 00:43
  • 1
    @JDo Thanks for your help! Your link had me researching more about how cloudfront works and helped me come up with a solution using requests. The timestamp was just from an example from yesterday but yes the Expiry is set quite small, only about 20 seconds from when the request is made. Cheers:) – West Oct 08 '19 at 04:40
  • 1
    The cloudfront url I was using before wasn't actually signed so I just had to find a way of getting a signed one, and after that everything worked – West Oct 08 '19 at 04:44
  • @West, nice one! You're welcome! – jDo Oct 08 '19 at 07:34

0 Answers0