1

Currently I have a series of columns which I am creating that contain a boolean based on a date in the Dataframe I am using

df['bool1'] = [1 if x > pd.to_datetime('20190731') else 0 for x in df['date']]

df['bool2'] = [1 if x > pd.to_datetime('20190803') else 0 for x in df['date']]

df['bool3'] = [1 if x > pd.to_datetime('20190813') else 0 for x in df['date']]

I figured that a list comprehension like this is a pythonic way of solving the problem. I feel like my code is very clear in what it is doing, and somebody could easily follow it.

There is a potential improvement to be made in say creating a dictionary for {bool1:'20190731'} then loop through Key:Value pairs so that I don't repeat the line of code. But this will only decrease line number whilst increase readability and scalability. It won't actually made my code run faster.

However my problem is that this code is actually very slow to run. Should I be using a lambda function to speed this up? What is the fastest way to write this code?

sanyassh
  • 8,100
  • 13
  • 36
  • 70
Violatic
  • 374
  • 2
  • 18

2 Answers2

2

I think dictionary for new columns with values for compare is nice idea.

d = {'bool1':'20190731', 'bool2':'20190803', 'bool3':'20190813'}

Then is possible create new columns in loop:

for k, v in d.items():
    df[k] = (df['date'] > pd.to_datetime(v)).astype(int)
    #alternative
    #df[k] = np.where(df['date'] > pd.to_datetime(v), 1, 0)

For improve performance use broadcasting in numpy:

rng = pd.date_range('20190731', periods=20)
df = pd.DataFrame({'date': rng})  

d = {'bool1':'20190731', 'bool2':'20190803', 'bool3':'20190813'}

#pandas 0.24+
mask = df['date'].to_numpy()[:, None] > pd.to_datetime(list(d.values())).to_numpy()
#pandas below
#mask = df['date'].values[:, None] > pd.to_datetime(list(d.values())).values
arr = np.where(mask, 1, 0)

df = df.join(pd.DataFrame(arr, columns=d.keys()))
print (df)
         date  bool1  bool2  bool3
0  2019-07-31      0      0      0
1  2019-08-01      1      0      0
2  2019-08-02      1      0      0
3  2019-08-03      1      0      0
4  2019-08-04      1      1      0
5  2019-08-05      1      1      0
6  2019-08-06      1      1      0
7  2019-08-07      1      1      0
8  2019-08-08      1      1      0
9  2019-08-09      1      1      0
10 2019-08-10      1      1      0
11 2019-08-11      1      1      0
12 2019-08-12      1      1      0
13 2019-08-13      1      1      0
14 2019-08-14      1      1      1
15 2019-08-15      1      1      1
16 2019-08-16      1      1      1
17 2019-08-17      1      1      1
18 2019-08-18      1      1      1
19 2019-08-19      1      1      1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

with numpy.where it should be faster


df['bool1'] = np.where(df['date'] > pd.to_datetime('20190731'), 1, 0)
df['bool2'] = np.where(df['date'] > pd.to_datetime('20190803'), 1, 0)
df['bool3'] = np.where(df['date'] > pd.to_datetime('20190813'), 1, 0)
help-ukraine-now
  • 3,850
  • 4
  • 19
  • 36