Automate download all links (of PDFs) inside multiple pdf files

Question

I'm trying to download journal issues from a website (http://cis-ca.org/islamscience1.php). I ran something to get all the PDF's on this page. However these PDF's have links inside them that link to another PDF.

I want to get the terminal articles from all the PDF links.

Got all the PDF's from the page: http://cis-ca.org/islamscience1.php

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://cis-ca.org/islamscience1.php"

#If there is no such folder, the script will create one automatically
folder_location = r'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

I'd like to get the articles linked inside these PDF's. Thanks in advance

Might already have an answer here: https://stackoverflow.com/q/27744210/10058326 — Kunj Mehta, Jun 14 '19 at 04:38
Possible duplicate of [Extract hyperlinks from PDF in Python](https://stackoverflow.com/questions/27744210/extract-hyperlinks-from-pdf-in-python) — bharatk, Jun 14 '19 at 04:39
I was hoping some automation of the whole process instead of going through each file. — DrFahizzle, Jun 14 '19 at 04:42

Kunj Mehta · Answer 1 · 2019-06-14T04:56:45.430

0

https://mamclain.com/?page=Blog_Programing_Python_Removing_PDF_Hyperlinks_With_Python

Take a look at this link. It shows how to identify hyperlink and sanitize the PDF document. You could follow it upto the identification part and then perform an operation to store the hyperlink instead of sanitizing.

Alternatively, take a look at this library: https://github.com/metachris/pdfx

edited Jun 14 '19 at 04:56

answered Jun 14 '19 at 04:45

Kunj Mehta

411
4
11

Automate download all links (of PDFs) inside multiple pdf files

1 Answers1