0

I have a conversion script, which converts pdf files and image files to text files. But it takes forever to run my script. It took me almost 48 hours to finished 2000 pdf documents. Right now, I have a pool of documents (around 12000+) that I need to convert. Based on my previous rate, I can't imagine how long will it take to finish the conversion using my code. I am wondering is there anything I can do/change with my code to make it run faster?

Here is the code that I used.


def tesseractOCR_pdf(pdf):

    filePath = pdf
    
    pages = convert_from_path(filePath, 500)

    # Counter to store images of each page of PDF to image 
    image_counter = 1

    # Iterate through all the pages stored above 
    for page in pages:
        # Declaring filename for each page of PDF as JPG 
        # For each page, filename will be: 
        # PDF page 1 -> page_1.jpg 
        # PDF page 2 -> page_2.jpg 
        # PDF page 3 -> page_3.jpg 
        # .... 
        # PDF page n -> page_n.jpg 

        filename = "page_"+str(image_counter)+".jpg"
        
        # Save the image of the page in system 
        page.save(filename, 'JPEG') 
        # Increment the counter to update filename 
        image_counter = image_counter + 1

    # Variable to get count of total number of pages 
    filelimit = image_counter-1


    # Create an empty string for stroing purposes
    text = ""
    # Iterate from 1 to total number of pages 
    for i in range(1, filelimit + 1): 
        # Set filename to recognize text from 
        # Again, these files will be: 
        # page_1.jpg 
        # page_2.jpg 
        # .... 
        # page_n.jpg 
        filename = "page_"+str(i)+".jpg"

        # Recognize the text as string in image using pytesserct 
        text += str(((pytesseract.image_to_string(Image.open(filename))))) 

        text = text.replace('-\n', '')     

    
    #Delete all the jpg files that created from above
    for i in glob.glob("*.jpg"):
        os.remove(i)
        
    return text

def tesseractOCR_img(img):

    filePath = img
    
    text = str(pytesseract.image_to_string(filePath,lang='eng',config='--psm 6'))
    
    text = text.replace('-\n', '')
    
    return text

def Tesseract_ALL(docDir, txtDir, troubleDir):
    if docDir == "": docDir = os.getcwd() + "\\" #if no docDir passed in 
        
    for doc in os.listdir(docDir): #iterate through docs in doc directory
        try:
            fileExtension = doc.split(".")[-1]
            
            if fileExtension == "pdf":
                pdfFilename = docDir + doc 
                text = tesseractOCR_pdf(pdfFilename) #get string of text content of pdf
                textFilename = txtDir + doc + ".txt"
                textFile = open(textFilename, "w") #make text file
                textFile.write(text) #write text to text file
            else:   
#             elif (fileExtension == "tif") | (fileExtension == "tiff") | (fileExtension == "jpg"):
                imgFilename = docDir + doc 
                text = tesseractOCR_img(imgFilename) #get string of text content of img
                textFilename = txtDir + doc + ".txt"
                textFile = open(textFilename, "w") #make text file
                textFile.write(text) #write text to text file
        except:
            print("Error in file: "+ str(doc))
            shutil.move(os.path.join(docDir, doc), troubleDir)
            
    for filename in os.listdir(txtDir):
        fileExtension = filename.split(".")[-2]
        if fileExtension == "pdf":
            os.rename(txtDir + filename, txtDir + filename.replace('.pdf', ''))
        elif fileExtension == "tif":
            os.rename(txtDir + filename, txtDir + filename.replace('.tif', ''))
        elif fileExtension == "tiff":
            os.rename(txtDir + filename, txtDir + filename.replace('.tiff', ''))
        elif fileExtension == "jpg":
            os.rename(txtDir + filename, txtDir + filename.replace('.jpg', ''))
docDir = "/drive/codingstark/Project/pdf/"
txtDir = "/drive/codingstark/Project/txt/"
troubleDir = "/drive/codingstark/Project/trouble_pdf/"

Tesseract_ALL(docDir, txtDir, troubleDir)

Does anyone know how can I edit my code to make it run faster?

DataWizard
  • 21
  • 7
  • Have you considered parallelization? Also, if you haven't yet, time sections of your code to find areas that take the most time. The pdf function I feel like there should be a better way to transfer the data between without first saving it off. – Tom Myddeltyn Nov 26 '20 at 23:36
  • @TomMyddeltyn Thank you for your reply! I am curious what is parallelization and how can I time sections of my code? – DataWizard Nov 26 '20 at 23:38
  • A very simple way is to just use `time.time()` and and store it before a section and then capture after and subtract giving the seconds. It would be a rough estimate. From a parallelization standpoint, it might make sense to parallelize the task. So instead of serially processing each file one at a time, run a bunch of them "simultaneously" With all the writing to disk operations involved here I think it should speed up your processes. In general, if you can reduce the file accesses and keep it in main memory it should be faster. – Tom Myddeltyn Nov 26 '20 at 23:44
  • Hey @TomMyddeltyn, just a side note. I think it would be really difficult to some edit your code without at least some sample files. Check out this [Help others reproduce the problem](https://stackoverflow.com/help/minimal-reproducible-example) guide. – Arthur Harduim Nov 26 '20 at 23:48
  • @TomMyddeltyn I am wondering how can I reduce the file accesses and keep it in main memory? – DataWizard Nov 27 '20 at 01:10

2 Answers2

1

I think a process pool would be perfect for your case.

First you need to figure out parts of your code that can run independent of each other, than you wrap it into a function.

Here is an example

from concurrent.futures import ProcessPoolExecutor

def do_some_OCR(filename):
    pass

with ProcessPoolExecutor() as executor:
    for file in range(file_list):
       _ = executor.submit(do_some_OCR, file)

The code above will open a new process for each file and start processing things in parallel.

You can find the oficinal documentation here: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor

There is also an really awesome video that shows step-by-step how to use processes for exactly this: https://www.youtube.com/watch?v=fKl2JW_qrso

0

Here is a compact version of the function removing the file write stuff. I think this should work based on what I was reading on the APIs but I haven't tested this.

Note that I changed from string to list because adding to a list is MUCH less costly than appending to a string (See this about join vs concatenation How slow is Python's string concatenation vs. str.join?) TLDR is that string concat makes a new string every time you are concatenating so with large strings you start having to copy many times.

Also, when you were calling replace each iteration on the string after concatenation, it was doing again creating a new string. So I moved that to operate on each string that is generated. Note that if for some reason that string '-\n' is an artifact that occured due to the concatenation previously, then it should be removed from where it is and placed here: return ''.join(pageText).replace('-\n','') but realize putting it there will be creating a new string with the join, then creating a whole new string from the replace.

def tesseractOCR_pdf(pdf):
 
    pages = convert_from_path(pdf, 500)

    # Counter to store images of each page of PDF to image 
    # Create an empty list for storing purposes
    pageText = []
    # Iterate through all the pages stored above will be a PIL Image 
    for page in pages:
        # Recognize the text as string in image using pytesserct 
        # Add the text to a list while removing the -\n characters.
        pageText.append(str(pytesseract.image_to_string(page)).replace('-\n',''))

    return ''.join(pageText)

An even more compact one-liner version

def tesseractOCR_pdf(pdf):
    #This takes each page of the pdf, extracts the text, removing -\n and combines the text.
    return ''.join([str(pytesseract.image_to_string(page)).replace('-\n', '') for page in convert_from_path(pdf, 500)])
Tom Myddeltyn
  • 1,307
  • 1
  • 13
  • 27