1

I am trying to read date from an OCR response of an image. The OCR output is something like this.

\nPatientsName:KantibhaiPatelAgeISex:71YearslMale\nRef.by:Dr.KetanShuklaMS.MCH.\nReg.Date:29/06/201519;03\nLabRefNo;ARY-8922-15ReportingDate.29/06/201519:10\nHEMOGRAMREPORT\nTESTRESULTREFERENCEINTERVAL\n

I am interested in extracting the reporting date i.e. 29/06/2015. Also I am interested in storing the patient details in a database (MongoDB) chronologically. Hence I need to store the date in a standardized format for easy future queries. All suggestions are welcomed.

Edit - Since the data is coming as an OCR response there tends to be a lot of noise and sometimes misinterpreted characters. Is there any method that can have a better fault tolerance for string searching.

re.search(r'Date:([0-9]{2}\/[0-9]{2}\/[0-9]{4})', ocr_response).group(1)

The above statement explicitly looks for numbers, but what if some number is not read or misinterpeted as a character ?

Harvey
  • 184
  • 1
  • 3
  • 15
  • Possible duplicate of [Converting string into datetime](http://stackoverflow.com/questions/466345/converting-string-into-datetime) – Blakes Seven Jan 28 '16 at 04:20
  • 1
    Your question is too broad but you can have a look at Python regex for starters. – Selcuk Jan 28 '16 at 04:24

2 Answers2

1

use re module:

import re

print re.search(r'[Date:]*([0-9]{0,2}[\/-]([0-9]{0,2}|[a-z]{3})[\/-][0-9]{0,4})', ocr_response).group(1)

Output:

29/06/2015
midori
  • 4,807
  • 5
  • 34
  • 62
  • The above RE would explicitly look for character with fixed length, but what if some characters are missing or misinterpreted? Can we add some sort of fault tolerance. – Harvey Jan 28 '16 at 05:14
  • i improved the regex, but i need to know exactly what might be to prepare better common regex for your response – midori Jan 28 '16 at 05:17
  • At some places the center portion (mm) is alternatively written as shorthand version of that month. `29/jan/2015` or hyphen (-) is used as a differentiator instead of (/) – Harvey Jan 28 '16 at 06:27
0

You should go with good NER(Named,Entity Recognition) model, you can custom train your own model if you have good amount of annotated training data or you can use pre-trained models which does not require annotated dataset.

Spacy is a good Python library for NER. Have a look on the link below- https://spacy.io/

It uses deep neural networks at the backend to recognize various entities present in the text (date in your case).

Hope it gives you an alternative to regular expression, thanks for the upvote in advance.