9

I'm working on making a PDF Web Scraper in Python. Essentially, I'm trying to scrape all of the lecture notes from one of my courses, which are in the form of PDFs. I want to enter a url, and then get the PDFs and save them in a directory in my laptop. I've looked at several tutorials, but I'm not entirely sure how to go about doing this. None of the questions on StackOverflow seem to be helping me either.

Here is what I have so far:

import requests
from bs4 import BeautifulSoup
import shutil

bs = BeautifulSoup

url = input("Enter the URL you want to scrape from: ")
print("")

suffix = ".pdf"

link_list = []

def getPDFs():    
    # Gets URL from user to scrape
    response = requests.get(url, stream=True)
    soup = bs(response.text)

    #for link in soup.find_all('a'): # Finds all links
     #   if suffix in str(link): # If the link ends in .pdf
      #      link_list.append(link.get('href'))
    #print(link_list)

    with open('CS112.Lecture.09.pdf', 'wb') as out_file:
        shutil.copyfileobj(response.raw, out_file)
    del response
    print("PDF Saved")

getPDFs()

Originally, I had gotten all of the links to the PDFs, but did not know how to download them; the code for that is now commented out.

Now I've gotten to the point where I'm trying to download just one PDF; and a PDF does get downloaded, but it's a 0KB file.

If it's of any use, I'm using Python 3.4.2

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
freddiev4
  • 2,501
  • 2
  • 26
  • 46

1 Answers1

13

If this is something that does not require being logged in, you can use urlretrieve():

from urllib.request import urlretrieve

for link in link_list:
    urlretrieve(link)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • 1
    Thanks for this. Even if it doesn't help the OP, I learned about a new Python function. :) – nivix zixer Apr 15 '15 at 04:46
  • Awesome. This works perfectly! One follow-up question though; how do I choose which directory to save the files in? – freddiev4 Apr 15 '15 at 05:15
  • 1
    @FreddieV4 you can specify the full path to the file in the second argument, see examples at http://stackoverflow.com/questions/6373094/how-to-download-a-file-to-a-specific-path-in-the-server-python. Thanks. – alecxe Apr 15 '15 at 12:00
  • > If this is something that does not require being logged in .... what changes if it does? – baxx Dec 01 '15 at 00:08