How to improve tesseract.js accuracy?

Question

Im using this piece of code from the website but its not accurate enough

 const worker1 = createWorker();
  const worker2 = createWorker();

  await worker1.load();
  await worker2.load();
  await worker1.loadLanguage("eng");
  await worker2.loadLanguage("eng");
  await worker1.initialize("eng");
  await worker2.initialize("eng");

  scheduler.addWorker(worker1);
  scheduler.addWorker(worker2);

  /** Add 10 recognition jobs */
  const {
    data: { text }
  } = await scheduler.addJob("recognize", image);

this is the type of image i'm trying to read its text:

thou it seems simple and easy ,sometimes tesseract fails to read it . is there any better alternatives to tesseract.js or any way to improve the accuracy?

Have you tried applying some filtering on the input images, to enhance the contrast, for example or enlarge them? I think one way to get better accuracy, is to do some modifications on the input images. — Kostas Minaidis, Dec 01 '19 at 13:53
actually i have applied some filters and removed some level of noise to make it more clear and performance is improved , but still its unable to read sometimes, i dont know why — PayamB., Dec 01 '19 at 13:57
You can start with this post: https://docparser.com/blog/improve-ocr-accuracy/ Increasing contrast, image sharpening, removing noise are some basic image enhancements that might help you get better accuracy results. — Kostas Minaidis, Dec 01 '19 at 14:12
Additionally, you might want to check threshold filtering. See this code for example: https://github.com/laurenzcodes/Canvas-Threshold-Effect — Kostas Minaidis, Dec 01 '19 at 14:14
You can also dive deeper into edge detection algorithms, like the Sobel Algorithm or Canny Algorithm. — Kostas Minaidis, Dec 01 '19 at 14:20
I use a negative version of your image and it works fine. Also additional gamma correction looks promising. — Aikon Mogwai, Dec 01 '19 at 18:11
I am facing accuracy issues as well piping in an HTML canvas with very basic black strokes on a white background. I am getting wildly inconsistent results with even just attempting to detect numbers :/ — Taylor A. Leach, Dec 06 '21 at 04:53

score 3 · Accepted Answer · edited Dec 19 '21 at 09:44

3

When applying OCR using Tesseract, it is important to preprocess the image so that the desired text to detect is in black with the background in white. To do this, you can apply a simple threshold to obtain a binary image. Here's the image after preprocessing:

Result from Tesseract

I implemented this approach in Python OpenCV, but you can adapt a similar strategy into Javascript!

import cv2
import pytesseract

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image and Otsu's Threshold to get a binary image
image = cv2.imread('1.png', 0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Perform OCR
data = pytesseract.image_to_string(thresh, lang='eng', config='--psm 6')
print(data)

cv2.imshow('thresh', thresh)
cv2.waitKey()

edited Dec 19 '21 at 09:44

MohamadKh75

2,582
5
28
54

answered Dec 03 '19 at 02:15

nathancy

42,661
14
115
137

1

thanks for the answer , do you know any special node js library to achieve that ? – PayamB. Dec 03 '19 at 09:09
1

using jimp i inverted the color and the accuracy is really improved and i think its enough for my current project , but i still need some good library to do that in node js , anyway thanks for your answer. – PayamB. Dec 03 '19 at 10:35
1

Unfortunately, I'm not too familiar with node.js but once you find one you can follow the same approach. Good luck! – nathancy Dec 03 '19 at 20:38
1

Thanks for the hint regarding Jimp; I'm not sure why it shouldn't be possible to port it but I found something that looks similar and runs on Node.js: [Nimp](https://github.com/dan335/nimp) – gekkedev Mar 14 '21 at 17:04
I can recommend using the `sharp` npm library, it has all these features built in – Peter Ferencz Sep 24 '22 at 11:54

How to improve tesseract.js accuracy?

1 Answers1