2

I have pandas dataframe column that has string values in the format YYYY-MM-DD HH:MM:SS:mmmmmmm, for example 2021-12-26 21:10:18.6766667. I have verified that all values are in this format where milliseconds are in 7 digits. But the following code throws conversion error (shown below) when it tries to insert data into an Azure Databricks SQL database:

Conversion failed when converting date and/or time from character string

Question: What could be a cause of the error and how can we fix it?

Remark: After conversion the initial value (for example 2021-12-26 21:10:18.6766667) even adds two more digits at the end to make it 2021-12-26 21:10:18.676666700 - with 9 digits milliseconds.

import sqlalchemy as sq
import pandas as pd
import datetime

data_df = pd.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')

data_df['OrderDate'] = data_df['OrderDate'].astype('datetime64[ns]')

data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10), 
   'Name' : sq.VARCHAR(50),
   'OrderDate' : sq.DATETIME()})
tdy
  • 36,675
  • 19
  • 86
  • 83
nam
  • 21,967
  • 37
  • 158
  • 332
  • Out of curiosity, do you get this same error if you convert `data_df['OrderDate']` to a `datetime.datetime`? – Simon Mar 06 '22 at 02:22
  • @Simon If you mean `pd.to_datetime(data_df['OrderDate'])` , then yes exact same error. And in both cases it adds even two more digits `00` at the end to 7 digits milliseconds. – nam Mar 06 '22 at 02:27
  • Sorry, I meant the Python `datetime.datetime`. Like shown [here](https://stackoverflow.com/questions/37354498/how-to-convert-a-pandas-data-frame-column-from-np-datetime64-to-datetime) – Simon Mar 06 '22 at 02:31
  • @Simon I'm new to python world. On `datetime.datetime(data_df['OrderDate'])`, I get the error: `TypeError: cannot convert the series to `. It his is not what you meant please let me know, maybe, by giving a one liner code. – nam Mar 06 '22 at 02:41
  • 1
    @tdy Thanks for clarifying. On `data_df['OrderDate'] = pd.to_datetime(data_df['OrderDate']).dt.to_pydatetime()`, I still get the exact same error. But the number of digits in milliseconds reduces to 6 from 7. – nam Mar 06 '22 at 02:50
  • Try: `data_df['OrderDate'] = pd.to_datetime(data_df['OrderDate']).dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]` – tdy Mar 06 '22 at 05:28
  • 1
    @tdy Your suggestion worked perfectly (thank you). For the benefit of other users, you may want to convert your comment into a Response. And, I'll mark it as an answer. – nam Mar 07 '22 at 00:27
  • @nam Great. After some more digging, I think it will actually work without converting `OrderDate` at all. In other words, load with `read_csv` and then directly save `to_sql` (without changing `OrderDate`). – tdy Mar 07 '22 at 01:26
  • 1
    @tdy In my case the raw data did not work. So, I used your second choice. But it's good that you mentioned another option as it may work for some other readers of this post. – nam Mar 07 '22 at 01:33

1 Answers1

1

Keep the dates as plain strings without converting to_datetime.

This is because DataBricks SQL is based on SQLite, and SQLite expects date strings:

In the case of SQLite, date and time types are stored as strings which are then converted back to datetime objects when rows are returned.


If the raw date strings still don't work, convert them to_datetime and reformat into a safe format using dt.strftime:

df['OrderDate'] = pd.to_datetime(df['OrderDate']).dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]

Or if the column is already datetime, use dt.strftime directly:

df['OrderDate'] = df['OrderDate'].dt.strftime('%Y-%m-%d %H:%M:%S.%f').str[:-3]
tdy
  • 36,675
  • 19
  • 86
  • 83