Extract embedded pdf document from a webpage

Question

I am trying to write a Python program that is able to extract a PDF file that is embedded in a website, e.g., in a PDF viewer. However, I haven't yet been able to find a robust way to accomplish this.

Is there a way or best practice to identify PDFs based on MIME-type maybe?

[mime-type](https://stackoverflow.com/a/312258/6689249) is `application/pdf` — aiven, Jan 07 '18 at 23:58
Hello Aiven. Thanks for your reply. But in this case, how can you identify the mime-type if the content is embedded? — , Jan 08 '18 at 00:01
there is also [some](https://stackoverflow.com/a/26230781/6689249) suggestion on how to download pdf. And can you provide example with embedded pdf (site link maybe)? — aiven, Jan 08 '18 at 00:05
Of course, I just went online to find a random webpage that includes an embedded pdf: https://issuu.com/futurepublishing/docs/art274.issuu Of course they have a download link here, but that's not the aim. It's really about how to identify that there's a pdf embedded. :) — , Jan 08 '18 at 00:07

score -1 · Answer 1 · answered Jan 08 '18 at 00:16

So basically what you need is to search for iframe in html page and check src attribute, it should contain url to the pdf file.

For example: <iframe src="/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf" style="border: none; width: 100%; height: 100%;" frameborder="0"></iframe> from https://pdfobject.com/examples/pdfjs-forced.html

And so needed pdf url will be: https://pdfobject.com/pdfjs/web/viewer.html?file=%2Fpdf%2Fsample-3pp.pdf

Notice that not every pdf web-readers provide ability to check location of file. For example site that you've shared don't do that.

You can load html page with urllib or requests and search for html-tag with beautifulsoup, or use scrapy, or tons of other tool.

Extract embedded pdf document from a webpage

1 Answers1