2

I am having a dataframe with a column named 'Formatted Date', by default the column's dtype is'object'.

I am trying to Convert it to datetime replacing all '-' separators with '/' and sorting the column by date, this is the raw format:

0   2006-04-01 00:00:00.000 +0200   
1   2006-04-01 01:00:00.000 +0200
2   2006-04-01 02:00:00.000 +0200
3   2006-04-01 03:00:00.000 +0200
4   2006-04-01 04:00:00.000 +0200

when I run below code everything seems good and all - separators will be replaced with /

df= pd.read_csv (r"df.csv")
df['Formatted Date'] = df['Formatted Date'].apply(lambda x:  x.replace( '-' , '/'))

out:

0   2006/04/01 00:00:00.000 +0200
1   2006/04/01 01:00:00.000 +0200
2   2006/04/01 02:00:00.000 +0200
3   2006/04/01 03:00:00.000 +0200
4   2006/04/01 04:00:00.000 +0200

BUT when I try to convert it to datetime using below code all separators will get back to -

df['Formatted Date'] = pd.to_datetime((df['Formatted Date']), format = '%Y/%m/%d %H:%M:%S', utc=True)

and the output will be:

0   2006-03-31 22:00:00+00:00
1   2006-03-31 23:00:00+00:00
2   2006-04-01 00:00:00+00:00
3   2006-04-01 01:00:00+00:00
4   2006-04-01 02:00:00+00:00

AND for sorting I am using df.sort_values(by=['Formatted Date']) which is not working( It's not sorting the values, does nothing...)

EDIT Thank you ljdyer for the Answer and Yes, I am using jupyter notebook.

df.style.format code is not changing the separators in my case and I don't know why

df.style.format({'Formatted Date': lambda t: t.strftime("'%Y/%m/%d %H:%M:%S")}) 
df

Out:

0   2006-04-01 00:00:00.000 +0200
1   2006-04-01 01:00:00.000 +0200
2   2006-04-01 02:00:00.000 +0200
3   2006-04-01 03:00:00.000 +0200
4   2006-04-01 04:00:00.000 +0200

Using:

df['Formatted Date'] = pd.to_datetime((df['Formatted Date']), format = '%Y/%m/%d %H:%M:%S', utc=True)

df['Formatted Date'] = df['Formatted Date'].dt.strftime('%Y/%m/%d %H:%M:%S')

Does change the separators but the dtype remains object,as you mentioned. is there any way to keep the column format as datetime64 and change the separators from - to / ?

and about sorting , this code: df.sort_values(by=['Formatted Date']) is not working because the first date in dataset is: 2006-01-01 00:00:00.000 +0100 not 2006-03-31 22:00:00+00:00

and one more thing I noticed is that when I open dataset in notepad++ the datetime is showing like:

2006-01-01 22:00:00.000 +0100
2006-01-01 23:00:00.000 +0100
2006-01-10 00:00:00.000 +0100
2006-01-10 01:00:00.000 +0100

Instead of:

2006-01-01 22:00:00.000 +0100
2006-01-01 23:00:00.000 +0100
2006-01-02 00:00:00.000 +0100
2006-01-02 01:00:00.000 +0100

by the way English isn't my native language I hope the explanations are clear

ShaMiiM
  • 21
  • 4

1 Answers1

1

To understand why the content of the column displays with - after you apply to_datetime, you need to think about what the datatype of the column is at each stage in your code and understand the following point made by MarianD in this answer:

There is a difference between the content of a dataframe cell (a binary value) and its presentation (displaying it) for us, humans.

Why does 'Formatted Date' print with - after applying to_datetime?

When you read your data from your csv file, the Formatted Date column will have dtype object, which is the datatype that pandas uses for storing strings (you can check the type by writing print(df.dtypes). Strings are sequences of printable characters, so you can replace - characters with / and / characters will be printed when you print the dataframe, as you did using the following line:

df['Formatted Date'] = df['Formatted Date'].apply(lambda x:  x.replace( '-' , '/'))

After running this line the column still has dtype object, so the / characters will be printed when you print the dataframe.

Next you ran this line:

df['Formatted Date'] = pd.to_datetime((df['Formatted Date']), format = '%Y/%m/%d %H:%M:%S', utc=True)

to_datetime converts the column dtype to datetime64. datetime64 is a special type for storing datetimes which has its own internal method to represent instances when printing, which uses - characters. When you print the dataframe, you get the representation of the datetime64 instances in the Formatted Date column. Note the the format parameter in to_datetime specifies the format to read from, and has nothing to do with the format in which the datetimes are represented.

Having the dtype as datetime64 allows you to manipulate the data as datetimes in a way you couldn't when the dtype was object. For example, calling df.sort_values(by=['Formatted Date']) should correctly sort the column by date and time.

Why does df.sort_values not work?

I still can't quite follow your examples because I get different times in UTC (I guess because I'm in a different timezone) and the datetimes you provided look to be in order whichever way I look at them, but I experimented by changing the order around and was able to get back into order using df = df.sort_values(by=['Formatted Date']).

Like most pandas methods on dataframes, you need to either assign the result to the dataframe or use inplace to actually change the content of the dataframe:

df = df.sort_values(by=['Formatted Date'])

OR

df.sort_values(by=['Formatted Date'], inplace=True)

How can I print the datatimes with / characters after converting to datetime64 dtype?

It seems that you want to dataframe to contain datetimes, but to be presented with / characters. To control the way the datetimes are printed, you basically need to convert back to a string type, but the method you use will vary slightly depending on your use case.

If you are working in an iPython notebook (Jupyter, Google Colab), the cleanest way to do this is using df.style.format

df.style.format("{:%m/%d/%Y %H:%M:%S}", subset=['Formatted Date'])

This has the advantage of changing the way the dataframe is displayed without affecting the content of the dataframe.

The df.style.format option is not available if running in the terminal (Spyder, VS Code, etc.), so the only option as far as I'm aware would be to convert the column elements to strings before printing, and then convert back if you need to:

df['Formatted Date'] = df['Formatted Date'].dt.strftime('%Y/%m/%d %H:%M:%S')
print(df)
df['Formatted Date'] = pd.to_datetime((df['Formatted Date']), format = '%Y/%m/%d %H:%M:%S', utc=True)

I believe there are other methods if outputting to Excel, for example, which are documented online.

ljdyer
  • 1,946
  • 1
  • 3
  • 11
  • Can you check my edited version please – ShaMiiM Sep 25 '22 at 11:01
  • On `df.style.format`, I reviewed the documentatoin at https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.format.html and it looks like you want to display the result directly. So remove the second line `df` and you'll get the output you want. I've confirmed this in Google Colab. Means that the actual dataframe is not changed so you have to include the `.style.format` each time you display. I also changed the syntax slightly to use f-strings which seems more concise and closer to the documentation (see the answer above) – ljdyer Sep 26 '22 at 04:24
  • I've also added a section to the above about `sort_values` – ljdyer Sep 26 '22 at 04:31
  • I'm not sure I can help with your Notepad++ issue because I can't reproduce it. When I save a list of datetimes to a csv file and then read with pandas I get whatever datetimes were in the csv file, and can't think of any reason why this would not be the case. – ljdyer Sep 26 '22 at 04:33