1

I'm having a problem with looping through urls from a txt file, in order to get the title of the pdfs. When there is only one URL the code runs with no problems, but when there are more it throws the following error: " raise utils.PdfReadError("Could not read malformed PDF file") PyPDF2.utils.PdfReadError: Could not read malformed PDF file ".

As for the text file, there is one URL per line, no comas, no weird formatting.

Any idea why this could be happening? (apologies if my question is not well formatted, its actually my first one) :)

import io
import requests
from bs4 import BeautifulSoup
from PyPDF2 import PdfFileReader

def extract_info_from_pdf_url():
    
    with open('pdfs.txt') as urls:
        for url in urls:
            r = requests.get(url)
            f = io.BytesIO(r.content)
            reader = PdfFileReader(f)
            title =  reader.getDocumentInfo().title
            print(url)
            print(title)


extract_info_from_pdf_url()


  • Did you check if the PDF file that gets downloaded is actually a good PDF file? – Jongware Jul 30 '20 at 10:47
  • 1
    Yes, they are fine. I've tested with a single URL and it runs fine. Then for testing purposes i've just added the exact same URL twice (so basically the list is comprised of 2 exact same URLs) and it fails. – Praxitelis Nikolaou Jul 30 '20 at 11:04
  • Good! I haven't used requests that much, is it possible it's asynchronous and you may have to need to wait before a complete file is downloaded? That could explain why one file works but two in quick succession don't. – Jongware Jul 30 '20 at 11:08
  • Sounds like a good call! I will give that a try and let you know how it goes.Much appreciated! – Praxitelis Nikolaou Jul 30 '20 at 11:25
  • Do check this: https://stackoverflow.com/q/60364827/2564301 – Jongware Jul 30 '20 at 11:57
  • Thanks for this! I actually tried adding the urls in a list and it worked! – Praxitelis Nikolaou Jul 30 '20 at 12:54

0 Answers0