Parsing dates from OCRed files using dateparser library

Question

I want to extract dates from OCR images using the dateparser lib.

import dateparser
data = []
listOfPages = glob.glob(r"C:/Users/name/folder/test/*.tif")
for entry in listOfPages:
    text1 = pytesseract.image_to_string(
            Image.open(entry), lang="deu"
        )
    text = re.sub(r'\n',' ', text1)     
    date1 = re.compile(r'(Dresden(\.|,|\s+)?)(.*)', flags = re.DOTALL | re.MULTILINE)
    date = date1.search(text)
    if date:
        dates = dateparser.parse(date.group(3), date_formats=['%d %m %Y'], languages=['de'], settings={'STRICT_PARSING': True})
        
    else:
        dates = None
        if dates == None:
            dates = dateparser.parse(date.group(3), date_formats=['%d %B %Y'], locale = 'de', settings={'STRICT_PARSING': True})
        else:
            dates = None

    data.append([text, dates])
    
df0 = pd.DataFrame(data, columns =['raw_text', 'dates'])
print(df0)

Why am i getting error: NameError: name 'dates' is not defined

update: TypeError: Input type must be str

Show the full traceback of the error as properly formatted text in the question. — Michael Butscher, Oct 24 '21 at 13:12
You are suppressing errors with `except: pass`. Some line before the assignment to `dates` is throwing an exception so the assignment never happens, so the name never gets defined. This is why you should not suppress errors. Change `pass` to `raise` or a `print()` call or both to find the error. — BoarGules, Oct 24 '21 at 13:16
thanks, BoarGules for the reminder. Now the error is `TypeError: Input type must be str` which points at the `date` variable. I thought the result of `date1.search(text)` would be a string or is it that regex span object thing? — id345678, Oct 24 '21 at 13:32
First of all, there is no match for `(City(\.|,|\s+)?)(.*)`, see [this regex demo](https://regex101.com/r/a6x4Ca/1). `date` is a match data object, not a string, but you are passing it to `dateparser.parse` as the first argument, you need to use `dateparser.parse(date.group(3)` as you want the value captured into `(.*)`. Also, `dateparser.parse(text, date_formats=['%d %m %Y'], languages=['de'], settings={'STRICT_PARSING': True})` does not find anything. What is the actual text? Also, note you do not need try excet here at all. — Wiktor Stribiżew, Nov 07 '21 at 15:26
Thanks, Wiktor! If i dont `try:except` i get error: `AttributeError: 'NoneType' object has no attribute 'group'`. If i keep it, it doesent give me any error anymore but also returns all `dates` values as `None`. `.group(3)` is the right one though, according to https://regex101.com/r/ah5b03/1 — id345678, Nov 07 '21 at 16:46
I updated the picture with one of my subset. Do you know why that one isnt working? https://regex101.com/r/HF6yRn/1 Otherwise your solution captures all the dates correctly! Very well done. — id345678, Nov 07 '21 at 17:56

score 1 · Accepted Answer · answered Nov 07 '21 at 17:42

The problem is that your date is a match data object. Also, I am not sure dateparser.parse does what you need. I'd recommend datefinder package to extract dates from text.

This is the regex I'd use:

\bDresden(?:[.,]|\s+)?(.*)

See the regex demo. It matches Dresden as a whole word (\b is a word boundary), (?:[.,]|\s+)? is a non-capturing optional group matching ,, . or one or more whitespaces, and then captures into Group 1 any zero or more chars (re.DOTALL allows . to match line separators, too).

Here is the Python snippet that seems to yield expected matches:

import pytesseract, dateparser, glob, re
import pandas as pd
import datefinder
from pytesseract.pytesseract import Image

imgpath = r'1.tif'
data = []
listOfPages = glob.glob(r"C:/Users/name/folder/test/*.tif")
listOfPages = [imgpath]
for entry in listOfPages:
    text = pytesseract.image_to_string(
            Image.open(entry), lang="deu"
        )

    dates = []
    date = re.search(r'\bDresden(?:[.,]|\s+)?(.*)', text, re.DOTALL)
    if date:
        dates = [t.strftime("%d %B %Y") for t in datefinder.find_dates(date.group(1))]
        #dates = dateparser.parse(date.group(1), date_formats=['%d %m %Y'], languages=['de'], settings={'STRICT_PARSING': True})

    data.append([text, dates])
    
df0 = pd.DataFrame(data, columns =['raw_text', 'dates'])
print(df0)

With your sample image, I get

                                            raw_text                               dates
0  Sächsischer Landtag DRUCKSACHE , 1972\n2. Wahl...  [17 October 1995, 18 October 1995]

Parsing dates from OCRed files using dateparser library

1 Answers1