Create searchable (multipage) PDF with Python

Question

I've found some guides online on how to make a PDF searchable if it was scanned. However, I'm currently struggling with figuring out how to do it for a multipage PDF.

My code takes multipaged PDFs, converts each page into a JPG, runs OCR on each page and then converts it into a PDF. However, only the last page is returned.

import pytesseract
from pdf2image import convert_from_path

pytesseract.pytesseract.tesseract_cmd = 'directory'
TESSDATA_PREFIX = 'directory'
tessdata_dir_config = '--tessdata-dir directory'

# Path of the pdf
PDF_file = r"pdf directory"
  
  
def pdf_text():
    
    # Store all the pages of the PDF in a variable
    pages = convert_from_path(PDF_file, 500)
  
    image_counter = 1

    for page in pages:

        # Declare file names
        filename = "page_"+str(image_counter)+".jpg"

        # Save the image of the page in system
        page.save(filename, 'JPEG')

        # Increment the counter to update filename
        image_counter = image_counter + 1

    # Variable to get count of total number of pages
    filelimit = image_counter-1

    outfile = "out_text.pdf"

    # Open the file in append mode so that all contents of all images are added to the same file
    
    f = open(outfile, "a")

    # Iterate from 1 to total number of pages
    for i in range(1, filelimit + 1):

        filename = "page_"+str(i)+".jpg"

        # Recognize the text as string in image using pytesseract
        result =  pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config) 

            
        f = open(outfile, "w+b")
        f.write(bytearray(result))
        f.close()

pdf_text()

How can I run this for all pages and output one merged PDF?

why do you use `f = open(outfile, "w+b")` ? You already opened it before `for`-loop for appending `f = open(outfile, "a")` and you shouldn't open it again and again. And you should close it after `for`-loop, not inside — furas, Aug 16 '21 at 12:51

furas · Accepted Answer · 2021-08-16T13:30:47.893

I can't run it but I think all problem is because you use open(..., 'w+b') inside loop - and this remove previous content, and finally you write only last page.

You should use already opened file open(outfile, "a") and close it after loop.

# --- before loop ---

f = open(outfile, "ab")

# --- loop ---

for i in range(1, filelimit+1):

    filename = f"page_{i}.jpg"

    result =  pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config) 

    f.write(bytearray(result))

# --- after loop ---
        
f.close()

BTW:

But there is other problem - image_to_pdf_or_hocr creates full PDF - with special headers and maybe footers - and appending two results can't create correct PDF. You would have to use special modules to merge pdfs. Like Merge PDF files

Something similar to

    # --- before loop ---
    
    from PyPDF2 import PdfFileMerger
    import io

    merger = PdfFileMerger()

    # --- loop ---
    
    for i in range(1, filelimit + 1):

        filename = "page_"+str(i)+".jpg"

        result =  pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config)
        
        pdf_file_in_memory = io.BytesIO(result)        
        merger.append(pdf_file_in_memory)
        
    # --- after loop ---
    
    merger.write(outfile)
    merger.close()

Used the `PdfFileMerger` and that solved my issue :-) thanks! — Artem, Aug 16 '21 at 13:22

score 1 · Answer 2 · answered Aug 16 '21 at 11:00

1

There are a number of potential issues here and without being able to debug it's hard to say what is the root cause.

Are the JPGs being successfully created, and as separate files as is expected?

I would suspect that pages = convert_from_path(PDF_file, 500) is not returning as expected - have you manually verified they are being created as expected?

answered Aug 16 '21 at 11:00

Kyle Jones

31
1
3

Yes, the JPGS are created as expected with 1 image for each page. I'm suspecting the last loop that OCR's the images and writes bytearrays, however I haven't been able to fix it yet. – Artem Aug 16 '21 at 11:27
Might be `f = open(outfile, "w+b")`. This opens the file in write mode but you'll probably want a for append – Kyle Jones Aug 16 '21 at 13:50

Create searchable (multipage) PDF with Python

2 Answers2