3

I am using pytesseract in the below code:

    def fnd():
    for fname in list:
        x = None
        x = np.array([np.array(PIL.Image.open(fname))])
        print x.size
        for im in x:
                     txt = pytesseract.image_to_string(image=im).encode('utf-8').strip()
                     open("Output.txt","a+").write(txt)
                     with open("Output.txt") as openfile:                        
                         for line in openfile:
                             for part in line.split():
                                 if "cyber" in part.lower():
                                     print(line)
                                     return

The list contains names of images from a folder (2408*3506 & 300 res Gray-scaled). Unfortunately for around 35 images the total processing time is around 1400-1500 seconds.

Is there a way I can reduce the processing time?

Mikku
  • 6,538
  • 3
  • 15
  • 38
  • Maybe you could try multi threading, [here](https://www.geeksforgeeks.org/multithreading-python-set-1/). I think your images are being processed sequentially(One after the other). So running them parallel might save some time. – Vineeth Sai Aug 29 '18 at 06:16
  • I can try that ..... But @VineethSai even if i put a single image in this code, it takes around 30-50 seconds to process. – Mikku Aug 29 '18 at 06:19
  • You are trying to find characters from an image right ?, Maybe you can use PIL library to reduce the resolution of the image before you pass it into tesseract. Maybe _2408*3506 & 300 res Gray-scaled_ this huge resolution is not necessary for tesseract to recognize characters. – Vineeth Sai Aug 29 '18 at 06:22
  • So reducing resolution combined with parallel processing might reduce time alot. – Vineeth Sai Aug 29 '18 at 06:25
  • @VineethSai ... That was the first thought in my mind. I tried with resolution of 100 & 200 as well, but the quality of output was compromised in that case. I was able to achieve the correct output only with resolution 300. But yeah changing the color to Grayscale reduced time & improved accuracy. – Mikku Aug 29 '18 at 06:25
  • Try resizing the image to a lower resolution maybe ? There are more than 8 million pixel values for each image you're using. Running Tesseract on such a high density of pixels will slow things down. The image will look bad to you but tesseract shouldn't have problem with it. – Vineeth Sai Aug 29 '18 at 06:32
  • @VineethSai ... I was able to reduce the processing time by half after changing the dimensions of images to 50% and 200 dpi.... anything lower than these are producing vague results. But it still is too much time. – Mikku Aug 29 '18 at 07:09
  • I think the only place where we can improve the efficiency is on the image pre-processing. That is to provide only the necessary data to the tesseract OCR. Can you give one image and I'll try it on my machine to see the optimal adjustments. – Vineeth Sai Aug 29 '18 at 07:31
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178999/discussion-between-mohit-and-vineeth-sai). – Mikku Aug 29 '18 at 07:46

1 Answers1

4

Pytesseract writes and reads every image you pass it. This is unnecessary when running 35 images. Instead, you should use a python tesseract API interface. This will be significantly faster.