-1

I was thinking of using BeautifulSoup but I'm not that good.

Basically, this is the page with Egyptian translations:

https://mjn.host.cs.st-andrews.ac.uk/egyptian/texts/corpus/pdf/

It's very basic, you have many link with PDFs and each link has a name.

Since the PDFs themselves are named with a jumble of numbers, I'd like to attach the correct name to the PDFs (i.e. the link name. Not sure if I can leave the commas in the names though).

Thanks in advance!

EDIT: Forgot to add my code:

import os, requests, bs4

url = 'https://mjn.host.cs.st-andrews.ac.uk/egyptian/texts/corpus/pdf/'


os.makedirs('Egypt', exist_ok=True)
res = requests.get(url)
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text, 'html.parser')

document = soup.select('#I do not know what to do')

name = #I do not know how to retrieve it

docFile = open(os.path.join('Egypt', name), 'wb')
for chunk in res.iter_content(100000):
    docFile.write(chunk)
docFile.close()
user3610033
  • 45
  • 1
  • 6

1 Answers1

1

Code is by SIM from this link: Download all pdf files from a website using Python

Amended to name the files from the text descriptions in each PDF link.

Enjoy!

Edit: updated to remove unwanted chars from filenames

# By SIM: https://stackoverflow.com/questions/54616638/download-all-pdf-files-from-a-website-using-python
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import re #for fixing filenames

url = "https://mjn.host.cs.st-andrews.ac.uk/egyptian/texts/corpus/pdf/"

# Your download folder goes here...
folder_location = "/Users/roger/Downloads"

#If there is no such folder, the script will create one automatically
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
count = 0
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    #filename = os.path.join(folder_location,link['href'].split('/')[-1])
    # ^^ Original above... line below uses link description as filename
    # strip filenames of any unwanted characters
    s = re.sub('[^0-9a-zA-Z\[\]]+', ' ', link.string)  
    filename = os.path.join(folder_location,s + ".pdf")
    print(filename)
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)
    count += 1
print(f'Downloaded {count} files.')
Roger
  • 70
  • 1
  • 9
  • For some reason it only saves me 92 out of around 325 files. I need to fix something I guess! Thanks by the way! – user3610033 May 14 '21 at 16:12
  • On my Mac the script downloads all 325 files. On other platforms, you may have to process the "link.string" to remove certain characters, e.g: link.string.replace(':','-') – Roger May 14 '21 at 17:06
  • There's a long discussion of the issues involved, with various solutions, here: https://stackoverflow.com/questions/295135/turn-a-string-into-a-valid-filename – Roger May 14 '21 at 17:07
  • Thanks, I'll give it a go! – user3610033 May 14 '21 at 19:22
  • I've tried removing the characters, but apparently that's not the problem. It stops before the 'books' section, no idea why! – user3610033 May 14 '21 at 22:17
  • What platform are you using? – Roger May 15 '21 at 05:33
  • Try the amended version above. It should be easy for you to tweak the filenames to make it work for your system. – Roger May 15 '21 at 13:08