2

I am trying to change the format of the date in a pandas dataframe. If I check the date in the beginning, I have:

df['Date'][0]
Out[158]: '01/02/2008'

Then, I use:

df['Date'] = pd.to_datetime(df['Date']).dt.date  

To change the format to

df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)

However, this takes a veeeery long time, since my dataframe has millions of rows.

All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD. How can I do it in a faster way?

python_enthusiast
  • 896
  • 2
  • 7
  • 26
  • [to_datetime()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) has a format option. – Jab Jan 17 '19 at 21:05
  • @Jaba Which would probably take just as long. – Rocky Li Jan 17 '19 at 21:06
  • Is the column stored as a string in your dataframe? If so, can you do the strptime conversion before loading the df? – Daniel Long Jan 17 '19 at 21:06
  • @Jaba yes, it takes long too. =/ – python_enthusiast Jan 17 '19 at 21:08
  • @DanielLong it is a string. I will check that – python_enthusiast Jan 17 '19 at 21:08
  • You'll have to drop down to numpy. jpp has an answer for this but I don't even know how to begin searching for it – roganjosh Jan 17 '19 at 21:08
  • @roganjosh https://stackoverflow.com/questions/16176996/keep-only-date-part-when-using-pandas-to-datetime is this what you mean? – python_enthusiast Jan 17 '19 at 21:11
  • @python_enthusiast nope and that won't help with speed – roganjosh Jan 17 '19 at 21:13
  • @DanielLong I might be wrong, but I believe strptime is only for individual strings. I want to convert a pandas series. – python_enthusiast Jan 17 '19 at 21:21
  • @python_enthusiast How are you creating the dataframe? Can you convert the string to a datetime object before initializing? – Daniel Long Jan 17 '19 at 21:28
  • the dataframe is coming from pd.read_csv(zf.open('filename') , where zf is zipfile.ZipFile('zipname'), so it is just from reading a big csv file. I can convert using pd.to_datetime(df['Date']). I don't know if there is a better way. – python_enthusiast Jan 17 '19 at 21:32
  • @python_enthusiast - Have you tried using something directly from pandas. For example pandas.to_datetime http://pandas.pydata.org/pandas-docs/version/0.23/whatsnew.html#to-datetime-has-gained-an-origin-parameter? Might optimize the performance. –  Jan 17 '19 at 21:59
  • 1
    @python_enthusiast, may be worth to check [here](https://stackoverflow.com/questions/32034689/why-is-pandas-to-datetime-slow-for-non-standard-time-format-such-as-2014-12-31) – Karn Kumar Jan 18 '19 at 06:12

3 Answers3

1

You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.

Then you parse the dates into the new format and merge the results back into the original DataFrame.

This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!

You can use the following code:

date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')

df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)

df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()

df.head()

In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.

Abramodj
  • 5,709
  • 9
  • 49
  • 75
0

I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.

Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.

import pandas as pd

df = pd.read_csv('test.csv', parse_dates=[0,2])

The above example would try to parse the 1st and 3rd (zero-based) columns as dates.

The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.

Daniel Long
  • 1,162
  • 14
  • 30
0

Following a lead at @pygo's comment, I found that my mistake was to try to read the data as

df['Date'] = pd.to_datetime(df['Date']).dt.date  

This would be, as this answer explains:

This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).

As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True

When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.

Hence, instead of doing

df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')

or

df['Date'] = pd.to_datetime(df['Date']).dt.date  

we should do

df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date

By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.

Thank you to everyone who helped!

Community
  • 1
  • 1
python_enthusiast
  • 896
  • 2
  • 7
  • 26