0

I am reading ~50 files and adding them to the same table consecutively, there is one for each month over the past few years. After the first year, the date format presented in the CSV files shifted from the format YYYY-mm-dd to mm/dd/YYYY. SQL Server is fine with the date format YYYY-mm-dd and is what it expects, but once the format switched in the CSV my program will crash I wrote a piece of code to try and convert the data to the correct format, but it didn't work, as shown here:

if '/' in df['SubmissionDate'].iloc[0]:
                    df['SubmissionDate'] = pd.to_datetime(df['SubmissionDate'], format = '%m/%d/%Y')

I believe that this would have worked, barring the issue that some of the rows of data have no date, so I need to either find some other way to allow the SQL Insert statement to accept this different date format, or avoid trying to convert the blank items in the Submission Date column.

Any help would be greatly appreciated!

Jadon Latta
  • 147
  • 1
  • 11
  • @AaronBertrand The issue is, this data is not mine to change, I have to deal with what I am given, so I have to deal with the mm/dd/YYYY format somehow, as I am building the database. Hence my desire to find some way to convert the date format – Jadon Latta Aug 12 '22 at 17:57
  • Then use `SET DATEFORMAT` at the start of your batch; but @AaronBertrand is right, the solution is to use an unambiguous format. But certainly don't be inconsistent. If you start with `yyyy-MM-dd` then use that the whole time; changing the format is just asking for problems. If you aren't responsible for that mistake, get whomever is to be responsible for it. – Thom A Aug 12 '22 at 17:59

1 Answers1

0

It sounds like you are not using parse_dates= when loading the CSV file into the DataFrame. The date parser seems to be able to handle multiple date formats, even within the same file:

import io
import pandas as pd

csv = io.StringIO(
    """\
id,a_date
1,2001-01-01
2,1/2/2001
3,12/31/2001
4,31/12/2001
"""
)
df = pd.read_csv(csv, parse_dates=["a_date"])
print(df)
"""
   id     a_date
0   1 2001-01-01
1   2 2001-01-02
2   3 2001-12-31
3   4 2001-12-31
"""
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • That seems... dangerous. What does it do with `04/06/2005`? How about `04/06/05`? – Aaron Bertrand Aug 14 '22 at 14:11
  • The default `date_parser` for [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) is [dateutil.parser.parse](https://dateutil.readthedocs.io/en/stable/parser.html#dateutil.parser.parse), and its default values are `dayfirst=False` and `yearfirst=False`. So, the default configuration will parse both `04/06/2005` and `04/06/05` as `datetime.date(2005, 4, 6)` – Gord Thompson Aug 14 '22 at 15:09
  • Still seems a little too ambiguous and forgiving if it can assume that 07/01/2005 is July 1st but then on the next line it sees 13/01/2005 and assumes you’ve changed language or geography and must mean January 13th, not that it might have misinterpreted January 7th in the previous line. – Aaron Bertrand Aug 14 '22 at 15:12
  • And I understand “this is how the software works” but that’s a really crappy “feature” and the real answer is not to take advantage of that but rather to stop feeding garbage regional formats anywhere in the first place. – Aaron Bertrand Aug 14 '22 at 15:13
  • FWIW, pandas does emit "UserWarning: Parsing '31/12/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing." – Gord Thompson Aug 14 '22 at 15:21
  • Sure, but warnings are easily dismissed / ignored, and now your data’s bad. – Aaron Bertrand Aug 14 '22 at 15:23
  • GIGO. "As it was in the beginning, is now and ever shall be…" – Gord Thompson Aug 14 '22 at 15:28
  • "warnings are easily dismissed / ignored" - We can always choose to [treat them like errors](https://stackoverflow.com/q/5644836/2144390). – Gord Thompson Aug 14 '22 at 15:40
  • We can. But _so many people_ do not, and generally they’re the ones coming here for help, not the ones providing it. – Aaron Bertrand Aug 14 '22 at 15:41