4

I'm using selenium to scrape a bunch of files which are provided in a mix of formats and styles - trying to handle both html and pdf, and I've come across an issue when the target of a link is a pdf file, but the link itself does not contain '.pdf' e.g., and (note that one automatically downloads, and one just displays the file - at least in chrome - so there may need to be a test for two different types of pdf targets as well?)

Is there a way to tell programmatically if the target of a link is pdf that is more intelligent than just checking if it ends in .pdf?

I can't just download the file no matter the content, because I have distinct handling for the html files, where I want to follow secondary links and see if I can find pdfs, which won't work if the target is a pdf directly.

ETA: The accepted answer worked perfectly - the linked potential dupe is for testing on file system, not for download so I don't think that's valid, and certainly the answer below is better for this situation.

gkennos
  • 371
  • 4
  • 14
  • Please check this [URL](http://stackoverflow.com/help) it will be useful to lift your content quality up – Willie Cheng May 20 '16 at 01:18
  • 1
    http://stackoverflow.com/questions/10937350/how-to-check-type-of-files-without-extensions-in-python could help resolve your probelm – glls May 20 '16 at 01:25
  • @willie I don't understand what you think I've done wrong here? Happy to edit my question if required, but it's not clear to me what has been missed from the posting guidelines. – gkennos May 20 '16 at 02:28

1 Answers1

4

Selenium (or Chrome) checks the 'Content-Type' headers and choose what to do. You can also check the 'Content-Type' of a URL yourself use requests like below:

>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> pprint.pprint(dict(r.headers))
{'Accept-Ranges': 'bytes',
  'Age': '8518',
  'Cache-Control': 'no-cache, must-revalidate, max-age=0',
  'Connection': 'keep-alive',
  'Content-Description': 'File Transfer',
  'Content-Disposition': 'attachment; '
  'filename="anzcor-guideline-6-compressions-apr-2021.pdf"',
    'Content-Length': '535677',
  'Content-Md5': '90AUQUZu0vFGJ7cBPvRxcg==',
  'Content-Security-Policy': 'upgrade-insecure-requests',
  'Content-Type': 'application/pdf',
  'Date': 'Wed, 19 Jan 2022 11:20:06 GMT',
  'Expires': 'Wed, 11 Jan 1984 05:00:00 GMT',
  'Last-Modified': 'Wed, 19 Jan 2022 08:58:08 GMT',
  'Pragma': 'no-cache',
  'Server': 'openresty',
  'Strict-Transport-Security': 'max-age=300, max-age=31536000; '
  'includeSubDomains',
    'Vary': 'User-Agent',
  'X-Backend': 'local',
  'X-Cache': 'cached',
  'X-Cache-Hit': 'HIT',
  'X-Cacheable': 'YES:Forced',
  'X-Content-Type-Options': 'nosniff',
  'X-Xss-Protection': '1; mode=block'}

As you can see, the 'Content-Type' of your two links are all 'application/pdf':

>>> r.headers['Content-Type']
'application/pdf'

So you can just check the output of requests.head(link).headers['Content-Type'], and do whatever you need.


For this moment (Jan 19 2022), the first link in your question redirects me to a 404 page. And the second one is still accessible, but it's needed to use HTTPS protocol by changing the link's start part from http:// to https://.

But anyway, if the URL doesn't redirect you to any other page, this answer isn't out-of-date. If the URL does, please request the newest URL by checking the status_code if it's a 301:

>>> r = requests.head('http://resus.org.au/?wpfb_dl=17')
>>> r.status_code
301
>>> r = requests.head('https://resus.org.au/?wpfb_dl=17')
>>> r.status_code
200
>>>
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
  • Well, I just reproduced the code you listed above and it gives me 'Content-Type': 'text/html;' for both links not 'application/pdf'. Any ideas what changed or what to change? I'd greatly appreciate this. – Kiryl A. Jan 19 '22 at 09:03
  • 1
    @KirylAleksandrovich I've tried it again, yes you're right. After many years, the links return something described in the header`'X-Cacheable': 'NO:HTTPS Redirect'`. This means the link should be started with `https://`, rather than HTTP. I've edited my code, please try again with the new code above. – Remi Guan Jan 19 '22 at 11:22
  • Yeah, the new version works fine, thanks! – Kiryl A. Jan 19 '22 at 11:35