0

I have a code to extract text from pdf but I want also extract text from images in pdf. I want to protect the order of written text and the text on image when dealing it. Here is my code to extract the written text:

def convertPdfToText(self,outputTextFile):
     try:
         with open(fileToConvert,'rb') as pdf_file, open(outputTextFile, 'w') as text_file:
             read_pdf = PyPDF2.PdfFileReader(pdf_file)
             number_of_pages = read_pdf.getNumPages()
             for page_number in range(number_of_pages): 
                  page = read_pdf.getPage(page_number)
                  page_content = page.extractText()
                  text_file.write(page_content)
     except:
        sys.exit("Any error is occurred.")
darknight
  • 1
  • 4
  • You have found a nice way to make all the errors undebugable. – Klaus D. Nov 19 '19 at 14:05
  • You'll probably have to use OCR to identify the text from the image, since that isn't a native method in PyPDF, but first you'll have to extract the images (maybe https://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python can help with that). – Marc Sances Nov 19 '19 at 14:08
  • Extracting the images would be helpful but I think it will change the order. Must I convert all pdf pages to images? I don't want that. I want to process text as text and image as image. – darknight Nov 19 '19 at 14:14

0 Answers0