2

I'm facing a problem in converting dates in my dataframe.

Example:

        col1                 col2
1     2018-10-02            2018-07-03 
      09:00:00+00           23:56:09.134+00
2     2018-07-03            2018-10-02
      23:56:09.134+00       09:00:00+00           


df.dtypes
col1    object
col2    object

Due to the fact that I need to do some ML on the dataset, I need to transform the dates into a float, so in the julian format.

I tried a lot of things like: Calculating julian date in python or

df['col1'] = df['col1'].dt.strftime("%y%j")

The example above works well after having converted the column to_datetime, but it raises an error if I pass df[['col1', "col2"]] :

AttributeError: 'DataFrame' object has no attribute 'dt'

I've also problem with the format, since some obs does have milliseconds and some doesn't. I can drop them I think, but also in this case I don't know how.

Also I'm not able to find a julian format which is extended up to the seconds (%y%j is not enough, and I don't know which letters of the format I need)

I have many other columns with dates in my df, so there is a simple way to convert all of them?

Thanks

Federicofkt
  • 149
  • 6

1 Answers1

2

Are you looking for pandas.Timestamp.to_julian_date?

import pandas as pd

df = pd.DataFrame({'col1': ['2018-10-02 09:00:00+00','2018-07-03 23:56:09.134+00'],
                   'col2': ['2018-07-03 23:56:09.134+00','2018-10-02 09:00:00+00']})


df['col1'] = pd.to_datetime(df['col1'])
df['col2'] = pd.to_datetime(df['col2'])

df['col1'].apply(pd.Timestamp.to_julian_date)
# 0    2.458394e+06
# 1    2.458303e+06
# Name: col1, dtype: float64

The returned floating point number represents number of days, e.g.

df['col3'] = pd.to_datetime(['2018-07-02 12:46:32.257000+00:00', '2018-07-02 13:02:15.855000+00:00'])
t = df['col3'].apply(pd.Timestamp.to_julian_date).values
print(f"col3 delta in minutes: {(t[1]-t[0])*24*60}")
# col3 delta in minutes: 15.726633667945862
FObersteiner
  • 22,500
  • 8
  • 42
  • 72
  • Many thanks for the reply, but it raises an attribute error: AttributeError: 'Series' object has no attribute 'year' when I try on multiple columns – Federicofkt Jun 10 '20 at 11:48
  • 1
    @Federicofkt maybe [this](https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas) actually is the issue here? Not totally sure what you're aiming at... – FObersteiner Jun 10 '20 at 11:54
  • .applymap() works perfectly! Now the only thing is that to_julian_date doesn't take the hour, because same days with different hours now are the same float. There is a way to do that? Thanks – Federicofkt Jun 10 '20 at 12:13
  • @Federicofkt the return value of Timestamp.to_julian_date is julian days, so the difference in hours should be represented in the decimals of the floating point number – FObersteiner Jun 10 '20 at 12:20
  • 2018-07-02 12:46:32.257000+00:00 and 2018-07-02 13:02:15.855000+00:00 have the same julian, that is 2.458302e+06, but the time is different – Federicofkt Jun 10 '20 at 12:28
  • @Federicofkt are you sure? I've added a little example to show what I mean. – FObersteiner Jun 10 '20 at 12:36
  • You were right, sorry for wasting your time. Thank you very much! – Federicofkt Jun 10 '20 at 15:23
  • @Federicofkt no worries, glad I could help! – FObersteiner Jun 10 '20 at 15:47