How to scrape pdf's that are embedded with BeautifulSoup

Question

I am trying to scrape this page recursively using BeautifulSoup.

The problem however is that the pdf links actually open a new page on which the pdf's are embedded. In this embedded page we can subsequently find the true pdf links from the embedded tag.

I added therefore a line to check if the content is of the application/pdf. However using the redirect url, I am unable to extract the pdf links from this new page with the embedded pdf.

I tried the following but this did not work (a valid pdf link is never found)

# run the following in a .py file:
# spider = fdb.OurSpider()
# spider.scrape_page(url=url)

import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from requests import get

import time

MAX_DEPTH = 10

class OurSpider:

    def __init__(self):
        """Init our Custom Spider"""

    def scrape_page(self, url):
        """Scrape page"""

        try:
            self.download_pdfs(url=url)

        except requests.exceptions.MissingSchema:
            print(f'skipped MissingSchema [{url}]')

            try:
                links = self.get_links(url=url)
                print(links)
            except:
                print('')

    def download_pdfs(self, url, depth=1):
        # If there is no such folder, the script will create one automatically
        print('')
        print(f'--- [{depth}] {url}')
        if depth > MAX_DEPTH:
            return 'max depth reached'

        soup = self.get_soup(url=url)
        links = soup.select("a[href$='.pdf']")

        for link in links:
            try:
                full_url = urljoin(url, link['href'])
                content = get(full_url)
                if content.status_code == 200 and content.headers['content-type'] == 'application/pdf':
                    self.download_pdf(full_url=full_url)

                elif full_url != url:
                    self.download_pdfs(url=full_url, depth=depth+1)

                else:
                    print('skipping url')

            except requests.exceptions.InvalidSchema:
                print(f'skipped InvalidSchema [{link}]')

        print('--- downloading pdfs done')

    def download_pdf(self, full_url):
        """Download single url"""

        filename = "".join(['tmp/', str(return round(time.time() * 1000)), '.pdf'])
        if not self.file_exists(filename=filename):

            print(f'{filename}: {full_url}')
            with open(filename, 'wb') as f:
                f.write(requests.get(full_url).content)

    def get_links(self, url):
        """Get the links given the url"""
        soup = self.get_soup(url=url)
        return soup.findAll('a', href=True)

    @staticmethod
    def file_exists(filename):
        """File exists locally"""
        return os.path.exists(filename)

    @staticmethod
    def get_soup(url):
        """Init the url"""
        response = requests.get(url)
        soup = BeautifulSoup(response.text, "html.parser")
        return soup

seems unclear for me, now I'm in the main site, after i click on the first `pdf` file such as ` Investor A `, what then ? — αԋɱҽԃ αмєяιcαη, Apr 02 '20 at 16:09
Click on the Annual Report PDF of Investor A. On that page that opens you'll have 5 documents embedded. — WJA, Apr 02 '20 at 16:12
"you'll have 5 documents embedded." -- I only see one PDF in an iframe tag which I can find using the CSS selector `div.iframeContainer iframe` — , Apr 02 '20 at 16:20
See https://stackoverflow.com/questions/5041008/how-to-find-elements-by-class to search for the div by class. Relevant BeautifulSoup [docs here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class) — , Apr 02 '20 at 16:26
Yes but that is not really a solution, what if you don't know beforehand that the page would be like this. I am looking for an automated way to extract all pdfs. Not to custom write code that it works only on this page. — WJA, Apr 02 '20 at 16:27

αԋɱҽԃ αмєяιcαη · Accepted Answer · 2020-04-03T03:17:12.900

3

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor, as_completed
import re
from urllib.parse import unquote

site = "https://www.masked.com/us/individual/resources/regulatory-documents/mutual-funds"


def main(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html.parser')
    target = [f"{url[:25]}{item.get('href')}"
              for item in soup.findAll("a", title="Annual Report")]
    return target


def parse(url):
    with requests.Session() as req:
        r = req.get(url)
        match = [unquote(f"{r.url[:25]}{match.group(1)}") for match in re.finditer(
            r"Override=(.+?)\"", r.text)]
        return match


with ThreadPoolExecutor(max_workers=50) as executor:
    futures = [executor.submit(parse, url) for url in main(site)]

links = []
for future in futures:
    links.extend(future.result())

print(f"Collected {len(links)}")


def download(url):
    with requests.Session() as req:
        r = req.get(url)
        if r.status_code == 200 and r.headers['Content-Type'] == "application/pdf;charset=UTF-8":
            name = r.url.rfind("/") + 1
            name = r.url[name:]
            return f"Saving {name}"
            with open(f"{name}", 'wb') as f:
                f.write(r.content)
        else:
            pass


with ThreadPoolExecutor(max_workers=50) as executor:
    futures = [executor.submit(download, url) for url in links]

for future in as_completed(futures):
    print(future.result())

edited Apr 03 '20 at 03:17

answered Apr 02 '20 at 18:31

αԋɱҽԃ αмєяιcαη

11,825
3
17
50

Looks good, do you mind just hiding the url in the answer? For regul purposes. I will try your code tomorrow. – WJA Apr 02 '20 at 19:13
Wow, that actually works :) What did you exactly do to make this work and what do you think was the problem? – WJA Apr 03 '20 at 07:46
1

@JohnAndrews Just used threads to collect all links, then parsed that parameter `Override` with regex where it's hold the real location behind. – αԋɱҽԃ αмєяιcαη Apr 03 '20 at 09:27
But would it work on other sites? Like the override is not just for the structure of this site? – WJA Apr 03 '20 at 09:44
1

@JohnAndrews your question is out of knowledge of site structure at all. Basically `Override` parameter is a param where it's handle a redirection to another location within the same host. Indeed each host have it's own case. for that particular one, we have catch the final destination source. for other sites, you would allow redirection in case if there's redirection then catch final destination. – αԋɱҽԃ αмєяιcαη Apr 03 '20 at 09:46
Any libraries that can handle any type instead of having to write all cases? – WJA Apr 03 '20 at 13:48
@JohnAndrews that's can't be done at all! what if the host redirect you to a captcha ? what if the host redirect you and asking for input before download ? what if it's requiring a click ? only you can handle this via `Machine Learning` model – αԋɱҽԃ αмєяιcαη Apr 03 '20 at 13:53
Any reading material on this? Like what should I look for on the web to handle some the most common cases. How is this called? – WJA Apr 03 '20 at 13:58
@αԋɱҽԃαмєяιcαη obsessed with your background. – El_1988 Nov 19 '20 at 00:32

How to scrape pdf's that are embedded with BeautifulSoup

1 Answers1