0

I am looking to read a pdf and extract the text from it. The pdf is present in one of the url's and I don't wish to download it. I wish to read it on-the-go from the internet. Is this even possible?

I tried using 'Tika' but it doesn't really work. It gave me error:

2019-08-29 15:39:15,416 [MainThread ] [WARNI] Tika server returned status: 500 {'status': 500}

from tika import parser
URL_path = "http://www.---path to .pdf"    
raw = parser.from_file(URL_path)
print(raw)
developer
  • 257
  • 1
  • 3
  • 15
  • So you want to read the content of a pdf file from online link? And for that you are writing a python script ? – shuberman Aug 29 '19 at 10:13
  • The "500" suggests that there is a problem with your request. Is the URL definitely correct? The server expects a GET request? If you copy the exact URL from your code and paste it in an incognito window in your browser, does it work? – saintamh Aug 29 '19 at 10:16
  • @saintamh , Thanks for your reply. Yes, the url is absolutely right. I tried opening it in incognito mode and it opens a PDF – developer Aug 29 '19 at 10:17
  • @mishsx Thanks for the reply, Yes, trying to write a script that would read an online PDF and extract text from it – developer Aug 29 '19 at 10:18
  • I mean but why tho? Unless you are doing this for a bulk amount of URL, it always makes sense to download it and scan it using an OCR library – shuberman Aug 29 '19 at 10:20
  • @mishsx, you are right. I am doing it for more than 100 URLS. – developer Aug 29 '19 at 10:24
  • Possible duplicate of [How can i grab pdf links from website with Python script](https://stackoverflow.com/questions/6222911/how-can-i-grab-pdf-links-from-website-with-python-script) – shuberman Aug 29 '19 at 10:29
  • @mishsx, See , i don't need the PDF links. I need the pdf data which can be extracted to a text file. – developer Aug 29 '19 at 10:30
  • Well, thats one part of the problem but the other part is already answered here -->https://stackoverflow.com/a/45480440/7841468 – shuberman Aug 29 '19 at 10:31

0 Answers0