1

In trying to process a large number of bank account statements given in CSV format I realized that some of the dates are incorrect (30th of February, which is not possible).

So this snippet fails [1] telling me that some dates are incorrect:

        df_from_csv = pd.read_csv(   csv_filename
                                   , encoding='cp1252'
                                   , sep=";"
                                   , thousands='.', decimal=","
                                   , dayfirst=True
                                   , parse_dates=['Buchungstag', 'Wertstellung']  
                                 )

I could of course pre-process those CSV files and replace the 30th of Feb with 28th of Feb (or whatever the Feb ended in that year).

But is there a way to do this in Pandas, while importing? Like "If this column fails, set it to X"?

Sample row

775945;28.02.2018;30.02.2018;;901;"Zinsen"

As you can see, the date 30.02.2018 is not correct, because there ain't no 30th of Feb. But this seems to be a known problem in Germany. See [2].


[1] Here's the error message:

ValueError: day is out of range for month

[2] https://de.wikipedia.org/wiki/30._Februar

Ugur
  • 1,914
  • 2
  • 25
  • 46
  • 1
    Is possible see some data sample? – jezrael Mar 17 '21 at 11:40
  • I don't get a `ValueError` when reading the sample row with your `read_csv` command. It just gives me the `30.02.2018` column as `object` instead of `datetime`. – tdy Mar 17 '21 at 12:02
  • Ok, I should have added that my intention was to convert it to dates. – Ugur Mar 17 '21 at 12:03
  • 1
    Ok, I would suggest reading the csv without automatic date parsing so the date columns get loaded as strings, then manually `.str.replace('30.02.2018', np.nan)` and `pd.to_datetime` – tdy Mar 17 '21 at 12:06

2 Answers2

1

Here is how I solved it:

I added a custom date-parser:

import calendar

def mydateparser(dat_str):
    """Given a date like `30.02.2020` create a correct date `28.02.2020`"""
    if dat_str.startswith("30.02"):
        (d, m, y) = [int(el) for el in dat_str.split(".")]
        # This here will get the first and last days in a given year/month:
        (first, last) = calendar.monthrange(y, m)
        # Use the correct last day (`last`) in creating a new datestring:
        dat_str = f"{last:02d}.{m:02d}.{y}"
    return pd.datetime.strptime(dat_str, "%d.%m.%Y")

# and used it in `read_csv`
for csv_filename in glob.glob(f"{path}/*.csv"):
    # read csv into DataFrame
    df_from_csv = pd.read_csv(csv_filename,
                              encoding='cp1252',
                              sep=";",
                              thousands='.', decimal=",",
                              dayfirst=True,
                              parse_dates=['Buchungstag', 'Wertstellung'],
                              date_parser=mydateparser
                             )

This allows me to fix those incorrect "30.02.XX" dates and allow pandas to convert those two date columns (['Buchungstag', 'Wertstellung']) into dates, instead of objects.

Ugur
  • 1,914
  • 2
  • 25
  • 46
0

You could load it all up as text, then run it through a regex to identify non legal dates - which you could apply some adjustment function.

A sample regex you might apply could be:

ok_date_pattern = re.compile(r"^(0[1-9]|[12][0-9]|3[01])[-](0[1-9]|1[012])[-](19|20|99)[0-9]{2}\b")

This finds dates in DD-MM-YYYY format where the DD is constrained to being from 01 to 31 (i.e. a day of 42 would be considered illegal) and MM is constrained to 01 to 12, and YYYY is constrained to being within the range 1900 to 2099.

There are other regexes that go into more depth - such as some of the inventive answers found here

What you then need is a working adjustment function - perhaps that parses the date as best it can and returns a nearest legal date. I'm not aware of anything that does that out of the box, but a function could be written to deal with the most common edge cases I guess.

Then it'd be a case of tagging legal and illegal dates using an appropriate regex, and assigning some date-conversion function to deal with these two classes of dates appropriately.

Thomas Kimber
  • 10,601
  • 3
  • 25
  • 42