6

I am a beginner in Tensorflow and I want to build an OCR model with Tensorflow that detects Arabic words from cursive Arabic fonts (i.e. joint Arabic handwriting). Ideally, the model would be able to detect both Arabic and English. Please see the attached image of a page in a dictionary that I am currently trying to OCR. The other pages in the book have the same font and layout with both English and Arabic.

I have two questions:

(1) Would I be training with individual characters in the joint/cursive Arabic text or would I need bounding boxes for the entire words or individual characters?

(2) Are there any other OCR Tensorflow (or Keras) models available that deal with cursive writing particularly with Arabic.

A scanned page of an Arabic dictionary that I wish to apply OCR with

piccolo
  • 2,093
  • 3
  • 24
  • 56

2 Answers2

3

Tesseract, an OCR engine from Google, has an Arabic trained model.

Learn more about it here: https://github.com/tesseract-ocr/tesseract

Languages it supports are here: https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages

The Arabic dataset is here: https://github.com/tesseract-ocr/tessdata/blob/master/ara.traineddata

Hope this helps!

Josh Payne
  • 373
  • 1
  • 10
1

I don't think so you can use the whole page as the input image, maybe word by word is a better choice as a primitive solution, let's look at these links:

https://hackernoon.com/latest-deep-learning-ocr-with-keras-and-supervisely-in-15-minutes-34aecd630ed8

http://ai.stanford.edu/~ang/papers/ICPR12-TextRecognitionConvNeuralNets.pdf

How to create dataset in the same format as the FSNS dataset?

Ali Abbasi
  • 894
  • 9
  • 22