1

Im adding week dates to big dataframes baset on day dates. It takes up to hours when working with 5 mln rows. Here is my code:

import datetime
import pandas as pd

pd.DataFrame(data = ['2022-05-01','2021-05-01','2022-02-15'])
df,col,interval_shift,pivot_index,kwargs = pd.DataFrame(data = ['2022-05-01','2021-05-01','2022-02-15']),0,0,'Week start date',{}
date_col_format = kwargs['date_col_format'] if 'date_col_format' in kwargs else '%Y-%m-%d'

print('Week Number starts',datetime.datetime.now())
df['Week Number'] = df[col].apply(lambda x: ((datetime.datetime.strptime(x, date_col_format) - (pd.DateOffset(days=interval_shift))).strftime('%Y') + '-'+(datetime.datetime.strptime(x, date_col_format)- (pd.DateOffset(days=interval_shift))).strftime('%V')))
print('Week Number ready',datetime.datetime.now())
df[pivot_index] = df['Week Number'].apply(lambda y: (pd.to_datetime(y + '-1', format='%G-%V-%u') + pd.DateOffset(days=interval_shift)).strftime('%Y-%m-%d'))
print('Week pivot_index ready',datetime.datetime.now())
df

I need both cols: week number and date of weeks mondays.

output

0 Week Number Week start date
0 2022-05-01 2022-17 2022-04-25
1 2021-05-01 2021-17 2021-04-26
2 2022-02-15 2022-07 2022-02-14

explanation for interval_shift is nececity to shift week forward and backward and get number and start date of shifted week (from friday to thurstay for example)

Peregru
  • 11
  • 2
  • 1
    you probably want to use [pd.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) exclusively, and also try to avoid `apply` (element-wise iteration) – FObersteiner Jul 14 '22 at 07:58
  • 1
    There also seems to be unnecessary conversion to datetime, then to string, then back to datetime - however, without exemplary data (a [mre]), it is basically just guessing. – FObersteiner Jul 14 '22 at 10:01
  • FObersteiner, thanx to pointing at a minimal reproducible example. Added it to my question. Cant really find out how to avoid lambda and confusing in python and pandas date formats. looking forrvard for any help – Peregru Jul 14 '22 at 12:44
  • @Peregru do you mind to check my answer? – rpanai Jul 14 '22 at 13:31
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Jul 22 '22 at 05:58

1 Answers1

1

There are two questions that can help you

import pandas as pd
# data
df = pd.DataFrame(
    data=['2022-05-01','2021-05-01','2022-02-15'],
    columns=["date"])
df["date"] = df["date"].astype("M8")

# Week number
df["Week number"] = df['date'].dt.strftime('%Y-%U')

# Week start date
df["Week start date"] = (
    df["date"] - 
    df['date'].dt.weekday.astype('timedelta64[D]'))

Test Performances

generate a 5M row df

import pandas as pd
# this has 50_000 rows
df = pd.DataFrame(
{"date": pd.date_range("1900-01-01", periods=50_000)})
# so we duplicate 100 times
df = pd.concat([df for i in range(100)], ignore_index=True)
# we save a copy for further tests
df_bk = df.copy()

Timing initial code

%%time
df["Week number"] = df['date'].dt.strftime('%Y-%U')
df["Week start date"] = (
    df["date"] - 
    df['date'].dt.weekday.astype('timedelta64[D]'))
CPU times: user 16 s, sys: 228 ms, total: 16.2 s
Wall time: 16.2 s

Timing using other approach

Here the most likely thing is that we have duplicated dates. So why don't just use our snippet on unique dates and then merge back to the original df?

# data 
df1 = df_bk.copy()
%%time
# extract unique dates
df = pd.DataFrame({"date": df1["date"].unique()})
# We use same snippet as above
df["Week number"] = df['date'].dt.strftime('%Y-%U')
df["Week start date"] = (
    df["date"] - 
    df['date'].dt.weekday.astype('timedelta64[D]'))
# We merge to the result
df = pd.merge(df1, df, on="date")
CPU times: user 685 ms, sys: 59.4 ms, total: 744 ms
Wall time: 744 ms

Which is a 20x speedup compared to the previous code.

rpanai
  • 12,515
  • 2
  • 42
  • 64
  • Hi, rpanai. Thanx! То calculate week number and week date separately and then join is a great idea. Ill try it as soon as i can. Ill also stuty m8 datatype and look for how can it be helpfull. How do you think can i somehow add a possibility to shift week dates to mark weeks not from monday to sundau or from sunday to saturday but -3 days for example from friday to thursday? And do it pandas way, not with slow lambda? – Peregru Jul 14 '22 at 16:08