27

I have a dataframe that provides two integer columns with the Year and Week of the year:

import pandas as pd
import numpy as np
L1 = [43,44,51,2,5,12]
L2 = [2016,2016,2016,2017,2017,2017]
df = pd.DataFrame({"Week":L1,"Year":L2})

df
Out[72]: 
   Week  Year
0    43  2016
1    44  2016
2    51  2016
3     2  2017
4     5  2017
5    12  2017

I need to create a datetime-object from these two numbers.

I tried this, but it throws an error:

df["DT"] = df.apply(lambda x: np.datetime64(x.Year,'Y') + np.timedelta64(x.Week,'W'),axis=1)

Then I tried this, it works but gives the wrong result, that is it ignores the week completely:

df["S"] = df.Week.astype(str)+'-'+df.Year.astype(str)
df["DT"] = df["S"].apply(lambda x: pd.to_datetime(x,format='%W-%Y'))

df
Out[74]: 
   Week  Year        S         DT
0    43  2016  43-2016 2016-01-01
1    44  2016  44-2016 2016-01-01
2    51  2016  51-2016 2016-01-01
3     2  2017   2-2017 2017-01-01
4     5  2017   5-2017 2017-01-01
5    12  2017  12-2017 2017-01-01

I'm really getting lost between Python's datetime, Numpy's datetime64, and pandas Timestamp, can you tell me how it's done correctly?

I'm using Python 3, if that is relevant in any way.

EDIT:

Starting with Python 3.8 the problem is easily solved with a newly introduced method on datetime.date objects: https://docs.python.org/3/library/datetime.html#datetime.date.fromisocalendar

Khris
  • 3,132
  • 3
  • 34
  • 54
  • Are the `Week` values [ISO week numbers](https://en.wikipedia.org/wiki/ISO_week_date) or do they represent units of 7 days? – unutbu Aug 01 '17 at 11:50
  • Initially I have timestamps in `s`, they are converted using `pd.to_datetime()`, the week is then extracted by using `dt.week` on the Timestamp. – Khris Aug 01 '17 at 11:53
  • 1
    There is a subtle pitfall here -- If `s` contains the date `2016-1-1`, then its ISO week number (returned by `dt.week`) is 53, and its ISO year (which you didn't record) is 2015. If you try to reconstitute the date using the year 2016 and ISO week 53, then you get 2017-01-02 (assuming Monday begins the week). So you can't roundtrip correctly unless you also record the ISO year (which is not always the same as the actual year.) – unutbu Aug 01 '17 at 12:11
  • @unutbu Thanks, this is good to know. – Khris Aug 01 '17 at 12:24

4 Answers4

19

Try this:

In [19]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
             pd.to_timedelta(df.Week.mul(7).astype(str) + ' days')
Out[19]:
0   2016-10-28
1   2016-11-04
2   2016-12-23
3   2017-01-15
4   2017-02-05
5   2017-03-26
dtype: datetime64[ns]

Initially I have timestamps in s

It's much easier to parse it from UNIX epoch timestamp:

df['Date'] = pd.to_datetime(df['UNIX_Time'], unit='s')

Timing for 10M rows DF:

Setup:

In [26]: df = pd.DataFrame(pd.date_range('1970-01-01', freq='1T', periods=10**7), columns=['date'])

In [27]: df.shape
Out[27]: (10000000, 1)

In [28]: df['unix_ts'] = df['date'].astype(np.int64)//10**9

In [30]: df
Out[30]:
                       date    unix_ts
0       1970-01-01 00:00:00          0
1       1970-01-01 00:01:00         60
2       1970-01-01 00:02:00        120
3       1970-01-01 00:03:00        180
4       1970-01-01 00:04:00        240
5       1970-01-01 00:05:00        300
6       1970-01-01 00:06:00        360
7       1970-01-01 00:07:00        420
8       1970-01-01 00:08:00        480
9       1970-01-01 00:09:00        540
...                     ...        ...
9999990 1989-01-05 10:30:00  599999400
9999991 1989-01-05 10:31:00  599999460
9999992 1989-01-05 10:32:00  599999520
9999993 1989-01-05 10:33:00  599999580
9999994 1989-01-05 10:34:00  599999640
9999995 1989-01-05 10:35:00  599999700
9999996 1989-01-05 10:36:00  599999760
9999997 1989-01-05 10:37:00  599999820
9999998 1989-01-05 10:38:00  599999880
9999999 1989-01-05 10:39:00  599999940

[10000000 rows x 2 columns]

Check:

In [31]: pd.to_datetime(df.unix_ts, unit='s')
Out[31]:
0         1970-01-01 00:00:00
1         1970-01-01 00:01:00
2         1970-01-01 00:02:00
3         1970-01-01 00:03:00
4         1970-01-01 00:04:00
5         1970-01-01 00:05:00
6         1970-01-01 00:06:00
7         1970-01-01 00:07:00
8         1970-01-01 00:08:00
9         1970-01-01 00:09:00
                  ...
9999990   1989-01-05 10:30:00
9999991   1989-01-05 10:31:00
9999992   1989-01-05 10:32:00
9999993   1989-01-05 10:33:00
9999994   1989-01-05 10:34:00
9999995   1989-01-05 10:35:00
9999996   1989-01-05 10:36:00
9999997   1989-01-05 10:37:00
9999998   1989-01-05 10:38:00
9999999   1989-01-05 10:39:00
Name: unix_ts, Length: 10000000, dtype: datetime64[ns]

Timing:

In [32]: %timeit pd.to_datetime(df.unix_ts, unit='s')
10 loops, best of 3: 156 ms per loop

Conclusion: I think 156 milliseconds for converting 10.000.000 rows is not that slow

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Maybe directly using the timestamps is a better idea indeed. However I'm working with tens of millions of lines, and the `datetime` stuff is notoriously slow. – Khris Aug 01 '17 at 11:57
  • @Khris, yes, using this method we can convert it precisely – MaxU - stand with Ukraine Aug 01 '17 at 11:59
  • @Khris, i've added timing - please check – MaxU - stand with Ukraine Aug 01 '17 at 12:06
  • Initially I'm doing a conversion at one point and then save all values to avoid having to do later conversions. But I didn't anticipate I would need to convert back like that, so it's probably the best to directly go from the epoch timestamps. – Khris Aug 01 '17 at 12:06
  • @MaxU - Can you compare week solutions? Because question is about converting with `week` and `year`. – jezrael Aug 01 '17 at 12:08
  • @jezrael, please read OP's comment to their own question - `Initially I have timestamps in __s__` – MaxU - stand with Ukraine Aug 01 '17 at 12:10
  • Your solution is indeed fast: `165 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)` – Khris Aug 01 '17 at 12:10
  • Interesting thing, if I first do `df["Test"] = pd.to_datetime(df.unix_ts, unit='s')` and then time this `%timeit df.Test.dt.date` it gets really slow: `18.9 s ± 370 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)`. Turns out it seems to be the `dt`-access that's so slow. – Khris Aug 01 '17 at 12:14
  • @Khris, yeah, that's interesting, thanks for update! – MaxU - stand with Ukraine Aug 01 '17 at 12:37
13

Like @Gianmario Spacagna mentioned for datetimes higher like 2018 use %V with %G:

L1 = [43,44,51,2,5,12,52,53,1,2,5,52]
L2 = [2016,2016,2016,2017,2017,2017,2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})


df['new'] = pd.to_datetime(df.Week.astype(str)+
                           df.Year.astype(str).add('-1') ,format='%V%G-%u')
print (df)
    Week  Year        new
0     43  2016 2016-10-24
1     44  2016 2016-10-31
2     51  2016 2016-12-19
3      2  2017 2017-01-09
4      5  2017 2017-01-30
5     12  2017 2017-03-20
6     52  2018 2018-12-24
7     53  2018 2018-12-31
8      1  2019 2018-12-31
9      2  2019 2019-01-07
10     5  2019 2019-01-28
11    52  2019 2019-12-23
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Weird, the docs imply that the first day is already defined by using `%W` or `%U`: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior – Khris Aug 01 '17 at 11:56
  • 3
    This approach only works until year 2018. It is not compliant with ISO-8601 because the 31st December 2018 is assigned to week 53 of 2018 instead of week 1 of 2019. Causing all of the subsequent datetimes to be shifted by 7 days. – Gianmario Spacagna Feb 25 '20 at 11:10
  • @GianmarioSpacagna - Can you explian more? – jezrael Feb 25 '20 at 11:11
  • 1
    Try with Year = 2019 and Week = 1 and it will return 2019-01-07 as date. The correct date should be 2018-12-31. Source: https://www.epochconverter.com/weeks/2019 – Gianmario Spacagna Feb 25 '20 at 11:14
  • 3
    I think the solution for ISO weeks is in using a different format string (%G-W%V-%u) as documented in https://stackoverflow.com/a/17087427/2919826 – Gianmario Spacagna Feb 25 '20 at 11:20
6

There is something fishy going on with weeks starting from 2019. The ISO-8601 standard assigns the 31st December 2018 to the week 1 of year 2019. The other approaches based on:

pd.to_datetime(df.Week.astype(str)+
                  df.Year.astype(str).add('-2') ,format='%W%Y-%w')

will give shifted results starting from 2019.

In order to be compliant with the ISO-8601 standard you would have to do the following:

import pandas as pd
import datetime

L1 = [52,53,1,2,5,52]
L2 = [2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})
df['ISO'] = df['Year'].astype(str) + '-W' + df['Week'].astype(str) + '-1'
df['DT'] = df['ISO'].map(lambda x: datetime.datetime.strptime(x, "%G-W%V-%u"))
print(df)

It prints:

   Week  Year         ISO         DT
0    52  2018  2018-W52-1 2018-12-24
1    53  2018  2018-W53-1 2018-12-31
2     1  2019   2019-W1-1 2018-12-31
3     2  2019   2019-W2-1 2019-01-07
4     5  2019   2019-W5-1 2019-01-28
5    52  2019  2019-W52-1 2019-12-23

The week 53 of 2018 is ignored and mapped to the week 1 of 2019.

Please verify yourself on https://www.epochconverter.com/weeks/2019.

Gianmario Spacagna
  • 1,270
  • 14
  • 12
  • 1
    Good observation. Python 3.8 introduces a new method on datetime.date objects that solves the entire problem: https://docs.python.org/3/library/datetime.html#datetime.date.fromisocalendar – Khris Feb 26 '20 at 07:07
  • 1
    Thanks @Khris. Unfortunately, most enterprise applications run on python 3.6 as most of the cloud providers currently support for managed environments but good to know for the future. – Gianmario Spacagna Feb 27 '20 at 08:15
1

If you want to follow ISO Week Date

Weeks start with Monday. Each week's year is the Gregorian year in which the Thursday falls. The first week of the year, hence, always contains 4 January. ISO week year numbering therefore slightly deviates from the Gregorian for some days close to 1 January.

The following sample code, generates a sequence of 60 Dates, starting from 18Dec2016 Sun and adds the appropriate columns.

It adds:

  • A "Date"
  • Week Day of the "Date"
  • Finds the Week Starting Monday of that "Date"
  • Finds the Year of the Week Starting Monday of that "Date"
  • Adds a Week Number (ISO)
  • Gets the Starting Monday Date, from Year and Week Number

Sample Code Below:

# Generate Some Dates
dft1 = pd.DataFrame(pd.date_range('2016-12-18', freq='D', periods=60))
dft1.columns = ['e_FullDate']
dft1['e_FullDateWeekDay'] = dft1.e_FullDate.dt.day_name().str.slice(0,3)


#Add a Week Start Date (Monday)
dft1['e_week_start'] = dft1['e_FullDate'] - pd.to_timedelta(dft1['e_FullDate'].dt.weekday,
                                                      unit='D')
dft1['e_week_startWeekDay'] = dft1.e_week_start.dt.day_name().str.slice(0,3)

#Add a Week Start Year
dft1['e_week_start_yr'] = dft1.e_week_start.dt.year

#Add a Week Number of Week Start Monday
dft1['e_week_no'] = dft1['e_week_start'].dt.week

#Add a Week Start generate from Week Number and Year
dft1['e_week_start_from_week_no'] = pd.to_datetime(dft1.e_week_no.astype(str)+
                  dft1.e_week_start_yr.astype(str).add('-1') ,format='%W%Y-%w')
dft1['e_week_start_from_week_noWeekDay'] = dft1.e_week_start_from_week_no.dt.day_name().str.slice(0,3)


with pd.option_context('display.max_rows', 999, 'display.max_columns', 0, 'display.max_colwidth', 9999):
    display(dft1)

enter image description here

ihightower
  • 3,093
  • 6
  • 34
  • 49