1

I am trying to write a Python program that is able to extract a PDF file that is embedded in a website, e.g., in a PDF viewer. However, I haven't yet been able to find a robust way to accomplish this.

Is there a way or best practice to identify PDFs based on MIME-type maybe?

waka
  • 3,362
  • 9
  • 35
  • 54
  • [mime-type](https://stackoverflow.com/a/312258/6689249) is `application/pdf` – aiven Jan 07 '18 at 23:58
  • Hello Aiven. Thanks for your reply. But in this case, how can you identify the mime-type if the content is embedded? –  Jan 08 '18 at 00:01
  • there is also [some](https://stackoverflow.com/a/26230781/6689249) suggestion on how to download pdf. And can you provide example with embedded pdf (site link maybe)? – aiven Jan 08 '18 at 00:05
  • Of course, I just went online to find a random webpage that includes an embedded pdf: https://issuu.com/futurepublishing/docs/art274.issuu Of course they have a download link here, but that's not the aim. It's really about how to identify that there's a pdf embedded. :) –  Jan 08 '18 at 00:07

1 Answers1

-1

So basically what you need is to search for iframe in html page and check src attribute, it should contain url to the pdf file.

For example: <iframe src="/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf" style="border: none; width: 100%; height: 100%;" frameborder="0"></iframe> from https://pdfobject.com/examples/pdfjs-forced.html

And so needed pdf url will be: https://pdfobject.com/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf

Notice that not every pdf web-readers provide ability to check location of file. For example site that you've shared don't do that.

You can load html page with urllib or requests and search for html-tag with beautifulsoup, or use scrapy, or tons of other tool.

aiven
  • 3,775
  • 3
  • 27
  • 52