4

I want to use dateparser to detect which cell contains a date. I have a broad range of different date formats: Fr, 21.02.2020 // 20.02.2020 // 21.02 // 21-02-2020 // January, 21 2020 // 21-Jan-2020 // 21/02/20 and I am sure there will still come a couple more in the future. The library dateparser is able to detect all of them pretty well, though it also detects 'PO', 'to','06','16:00' as date or relative date, which I don't want. I tried to check the Documentation and turn the relative date off or to look how to change to only detect "real dates". In the settings they offer different PARSERS and the possibility to only use some of them. These are the default PARSERS and the program runs through all of them:

'timestamp': If the input string starts with 10 digits, optionally followed by additional digits or a period (.), those first 10 digits are interpreted as Unix time.

'relative-time': Parses dates and times expressed in relation to the current date and time (e.g. “1 day ago”, “in 2 weeks”).

'custom-formats': Parses dates that match one of the date formats in the list of the date_formats parameter of dateparser.parse() or DateDataParser.get_date_data.

'absolute-time': Parses dates and times expressed in absolute form (e.g. “May 4th”, “1991-05-17”). It takes into account settings such as DATE_ORDER or PREFER_LOCALE_DATE_ORDER.

'base-formats': Parses dates that match one of the following date formats

I tried to only use one of them with the part settings={'base-formats':True}) in my code, nonetheless it won't work. Furthermore they offer the following snippet to turn of individual PARSERS:

>>> from dateparser.settings import default_parsers
>>> parsers = [parser for parser in default_parsers if parser != 'relative-time']
>>> parse('today', settings={'PARSERS': parsers})

Here pops up the error:

ModuleNotFoundError: No module named 'dateparser.settings'

I tried pip install, won't work.

Link to docu: https://dateparser.readthedocs.io/en/latest/#settings

And here's my code:

import dateparser

inputlist = [[' ','Supplier:',' Company Y', ' ', 'Project:','Carasco', ' '],[' ','21-Jan-2020',' ','Consultant:','James Farewell', ' ', ' '],['PO', ' Service', ' Cost Center', ' Accounting Object', ' deliver at', ' Amount', ' Unit'],['0106776','XYZ', 'Countable',' ', '16:00','6,00','h',],['Fr, 21.02.2020', '20.03.2020', ' ', ' ', ' ', ' ','6/04/20']]
print(inputlist)

outerlist=[]
for row in inputlist:
    innerlist = []
    for cell in row:
        parsecheck = dateparser.parse(cell,  languages=['en', 'de'], settings={'base-formats':True})
        if parsecheck == None:
            innerlist.append(0)
        else:
            innerlist.append(1)

    outerlist.append(innerlist)

print(outerlist)

I currently get:

[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0], [1, 0, 0, 0, 1, 1, 1], [1, 1, 0, 0, 0, 0, 1]]

Desired Output:

[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0,0, 0, 0], [1, 1, 0, 0, 0, 0, 1]]
hux0
  • 207
  • 1
  • 4
  • 17

2 Answers2

2

This is the best I could do:

import dateparser
import locale

inputlist = [[' ','Supplier:',' Company Y', ' ', 'Project:','Carasco', ' '],[' ','21-Jan-2020',' ','Consultant:','James Farewell', ' ', ' '],['PO', ' Service', ' Cost Center', ' Accounting Object', ' deliver at', ' Amount', ' Unit'],['0106776','XYZ', 'Countable',' ', '16:00','6,00','h',],['Fr, 21.02.2020', '20.03.2020', ' ', ' ', ' ', ' ','6/04/20']]
print(inputlist)

customlist = ["%d.%m.%Y", "%d-%b-%Y", "%w/%m/%y", "%a, %d.%m.%Y"]

outerlist=[]
saved = locale.setlocale(locale.LC_ALL)
locale.setlocale(locale.LC_ALL, 'de_de')
for row in inputlist:
    innerlist = []
    for cell in row:
        parsecheck = dateparser.parse(cell,  languages=['en', 'de'], settings={'PARSERS':['custom-formats']}, date_formats=customlist)
        if parsecheck == None:
            innerlist.append(0)
        else:
            innerlist.append(1)
    outerlist.append(innerlist)

locale.setlocale(locale.LC_ALL, saved)
print(outerlist)

The output is:

[[0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0, 1]]

For parsing Fr, 21.02.2020 I changed the locale to Germany and, near the end I go back to your initial locale.

The format was based on documentation of strftime() and strptime() Behavior

SergioR
  • 1,386
  • 1
  • 7
  • 14
  • ok, cool. is there a way to search for any of the customlist items within the string, so that it neglects the ```Fr,``` of ```Fr, 21.02.2020```and tries to find the ```21.02.2020```within the string? I guess that would be the solution. – hux0 Apr 29 '20 at 10:52
  • I guess the option would be to use regular expression instead of dateparser. – SergioR Apr 29 '20 at 11:00
  • What does the %a stand for? Is it the day? Since the default PARSERS recognize FR as Friday, I guess there must be a filling (like %d). I haven't find it yet though. – hux0 Apr 29 '20 at 11:10
  • The documentation says: %a = Weekday as locale’s abbreviated name. Sun, Mon, …, Sat (en_US); So, Mo, …, Sa (de_DE) – SergioR Apr 29 '20 at 11:12
  • So for the German version Fr should normally work, I do believe. – hux0 Apr 29 '20 at 11:13
  • Feel free to do the modifications you think it can work. All the formats accepted are listed in the documentation I linked in the answer. – SergioR Apr 29 '20 at 11:16
  • Maybe trying with only one language or including the locales would work. – SergioR Apr 29 '20 at 11:22
  • I tried it, does not work unfortunately. Where should I include the locales? – hux0 Apr 29 '20 at 13:57
  • It is supposed to go in dateparser.parse() but I couldn't make work. – SergioR Apr 29 '20 at 20:33
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212843/discussion-between-hux0-and-sergior). – hux0 Apr 30 '20 at 09:32
1

Agreed that changing the settings does not work as expected based on the docs. Looking at the code, it doesn't look like you can get date-only objects (though I'm not an expert and may have missed something). If I understand correctly, it should be settings = {'PARSER': 'base-formats'} instead of settings = {'base-formats':True}, but that doesn't solve your problem.

I can only suggest a work around making use of the fact that the hour and minute of the returned datetime object default to 0.

import dateparser

outerlist=[]
for row in inputlist:
    innerlist = []
    for cell in row:
        parsecheck = None

        if dateparser.parse(cell, settings={'STRICT_PARSING':True}) != None and dateparser.parse(cell).hour == 0:
            parsecheck = dateparser.parse(cell,  languages=['en', 'de'], settings={'PARSER':'date_formats'})
        if parsecheck == None:
            innerlist.append(0)
        else:
            innerlist.append(1)

    outerlist.append(innerlist)

STRICT_PARSING:True means the returned value is None if any ofYEAR, DAY or MONTH are missing, which takes care of 'PO', 'h' and '6,00' returning valid datetime objects. Checking if the hour attribute is zero gets rid of the valid times.

Unfortunately

for cell in row:
parsecheck = dateparser.parse(cell, languages=['en','de'], settings={'STRICT_PARSING':True, 'PARSER':'date_formats'})

    if parsecheck != None and parsecheck.hour == 0:
        innerlist.append(1)
    else:
        innerlist.append(0)

doesn't seem to work since it interprets '16:00' as a date

edit - you don't need to import datetime

kah
  • 23
  • 5
  • Yes, I tried ```settings = {'PARSER': 'base-formats'}``` as well before and it did not work either (as you said). To me the docs are not clearly specified. Does ```STRICT_PARSING:True``` followingly not work for 21.02., because the year is missing? That was one of the problems why I did not consider it. Thanks for your help!! – hux0 Apr 29 '20 at 09:34
  • 1
    Documentation says PARSERS is a list of name of parsers. You should try `settings = {'PARSERS': ['base-formats']}`, but those base-formats doesn't match the one on your list as far as I can see. – SergioR Apr 29 '20 at 10:13
  • You're correct, 21.02 won't be identified as a date with the above approach. The parse method uses a tokenizer that does not distinguish between punctuation characters (excluding ':', which it handles as a digit). So you can give 21.02 as a valid date, but 6,00, 6!00 and 6?00 are equally valid dates. – kah Apr 29 '20 at 11:42