-2

I want to convert web PDF's such as - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000's of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this? Thanks

1 Answers1

0

There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :

Here is a simple code example for that (using pdfplumber)

from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
    pdf = pdfplumber.open('img.pdf')
except: 
    # Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
    print(f'Error. Are you sure this is a PDF ?')
    continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()

EDIT : My bad, just realised you asked "without saving it to my PC". That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(

Jules Civel
  • 449
  • 2
  • 13
  • But that's exactly that i don't want to do sir. I have tried using that way. But it isn't feasible as their are 1000s of such PDFs daily. Hence not feasible to store them in PC, Any way to do all in that python code only ?? – Jay shankarpure Jan 26 '22 at 12:50
  • Yes sorry, I understood that a bit late. But if your problem is disk space, just saving them with a generic name making them replace over and over works well. If your problem is time processing, then I leave it to anyone else and am really interested in the answer as I tried to do so without succeeding. – Jules Civel Jan 26 '22 at 12:55