-1

I have a simple digit recognition project and have noticed that people generally use two approaches when doing so in Python. My goal is to input a PDF document and get the HANDWRITTEN digits in particular places of the page.

I saw that people either use opencv, as in this question, or scikitlearn, as is seen in this example. I am not familiar with either, and am wondering which one would be most simple to learn and implement, given my intended usage. Thanks.

Community
  • 1
  • 1
splinter
  • 3,727
  • 8
  • 37
  • 82
  • What do you mean by "get the digits"? Generally, you could use any pdf reading tool (pdfminer, etc..), open it up and use regular expressions to find your digits, if that's what you're referring to. I assume, considering that you mentioned scikit, that you didn't intend for that. – nir0s Mar 09 '17 at 18:32
  • The scikit-learn example is not solving the same problem! (Classifying a preprocessed and cropped digit != finding a digit). – sascha Mar 09 '17 at 18:32
  • I always recommend scikit-learn, it is much more robust and has many functionalities to help you deal with your large dataset. To get the digits, crop them based on their pixel position, and feed them to your machine learning algorithm. What are you planing on using? – JahKnows Mar 09 '17 at 18:33
  • sklearn has no object-detector. So It's not ready for OCR. OP should define his task. What are ```particular places```? – sascha Mar 09 '17 at 18:34

1 Answers1

1

I suggest that you should use both opencv and scikitlearn. After you turn your pdf into an image, you can use opencv for image pre-processing (Gaussian Blur, thresholding, Erosion/Dilation Filters), so that the digits will become more easy to extract. Then you can use contour tracing (again opencv) to detect the individual digits. After you have extracted your digits (and given that you have a training set), you can use scikitlearn for the classification.

GStav
  • 1,066
  • 12
  • 20
  • Thanks, that's useful. I do not have a training set. Is there some place where I can find a generic training set of digits? – splinter Mar 10 '17 at 10:14
  • As far as I know, the most famous training set of handwritten digits is [MNIST](http://yann.lecun.com/exdb/mnist/) . – GStav Mar 10 '17 at 14:34