Python Scraping PDF's From a Website Why Are They All Corrupt and the Same Size?

Question

Hopefully this one will be an easy one. I am trying to do some webscraping where I download all the pdf files from a page. Currently I am scraping files from a sports page for practice. I used Automatetheboringstuff + a post from another user (retrieve links from web page using python and BeautifulSoup) to come up with this code.

import requests
import time
from bs4 import BeautifulSoup, SoupStrainer

r = requests.get('http://secsports.go.com/media/baseball')

soup = BeautifulSoup(r.content)

for link in BeautifulSoup(r.text, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
    if 'pdf' in str(link):
        image_file = open(os.path.join('E:\\thisiswhereiwantmypdfstogo', os.path.basename(link['href'])), 'wb')
        for chunk in r.iter_content(100000):
            image_file.write(chunk)
            image_file.close()

The files that are output to the directory I specify are all there which is great, but the filesize is the same for all of them and when I open up AdobePro to look at them I get an error that says:

"Adobe Acrobat could not open "FILENAMEHERE" because it is either not a supported filetype or because the file has been damaged (for example, it was sent as an email attachment and wasn't correctly decoded)."

A little hint that clued me in to something going wrong with the write process was that after running image_file.write(chunk) it outputs the same numbers for each file.

Here is what the pdfs look like in the folder:

I am thinking I just need to add a parameter somewhere during the writing process for it to work correctly, but I have no idea what it would be. I did some Google searching for an answer and also searched a bit on here but cannot find the answer.

Thanks!

Maybe in the lib `from urllib.request import urlretrieve` the function `urlretrieve(link)` can help to downlaod the pdf edit : https://docs.python.org/3/library/urllib.request.html#legacy-interface if you need some information about the function — pwnsauce, May 03 '17 at 09:24
http://stackoverflow.com/questions/7243750/download-file-from-web-in-python-3 here there is multiple exemples. — pwnsauce, May 03 '17 at 09:32

score 1 · Answer 1 · answered May 03 '17 at 15:28

1

Hmmm. After doing some more research it seems like I figured out the problem. I do not understand exactly why this works, but I'll take a stab at it. I modified my code such that each link(['href']) becomes a response object. Then I wrote those to my directory and it worked.

answered May 03 '17 at 15:28

Kevin

391
3
6
22

1

Could you show how you modified your code? I encountered the same problem. However, in my case, the first few links work totally fine but after about 50-ish links have been downloaded, the `requests.get(url, stream=True)` keeps corrupting whatever `pdf` files downloaded. – IgNite May 12 '18 at 19:34

Python Scraping PDF's From a Website Why Are They All Corrupt and the Same Size?

1 Answers1