-1

I'm trying to download journal issues from a website (http://cis-ca.org/islamscience1.php). I ran something to get all the PDF's on this page. However these PDF's have links inside them that link to another PDF.

I want to get the terminal articles from all the PDF links.

Got all the PDF's from the page: http://cis-ca.org/islamscience1.php

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://cis-ca.org/islamscience1.php"

#If there is no such folder, the script will create one automatically
folder_location = r'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

I'd like to get the articles linked inside these PDF's. Thanks in advance

DrFahizzle
  • 67
  • 6
  • 1
    Might already have an answer here: https://stackoverflow.com/q/27744210/10058326 – Kunj Mehta Jun 14 '19 at 04:38
  • Possible duplicate of [Extract hyperlinks from PDF in Python](https://stackoverflow.com/questions/27744210/extract-hyperlinks-from-pdf-in-python) – bharatk Jun 14 '19 at 04:39
  • I was hoping some automation of the whole process instead of going through each file. – DrFahizzle Jun 14 '19 at 04:42

1 Answers1

0

https://mamclain.com/?page=Blog_Programing_Python_Removing_PDF_Hyperlinks_With_Python

Take a look at this link. It shows how to identify hyperlink and sanitize the PDF document. You could follow it upto the identification part and then perform an operation to store the hyperlink instead of sanitizing.

Alternatively, take a look at this library: https://github.com/metachris/pdfx

Kunj Mehta
  • 411
  • 4
  • 11