2

I have a huge timeseries DataFrame (about 100 000 000 rows) and I need to filter rows by conditions. Each condition there are in an each row of an another Dataframe. This Dataframe has about 2000 rows and each row is condition. Toy example:

df = pd.DataFrame({val: [1, 3, 2, 4, 3, 1, 2, 3], date: [2015-03-12, 2015-04-12, 2015-05-13, 2016-03-12, 2016-04-07, 2016-05-12, 2017-01-11, 2017-03-20]})
df_condition = pd.DataFrame({val: [2, 3], date: [2015-07-13, 2016-04-08]})

Condition is remove all raws in df, where val appears earlier than date in df_condition:

df = df[(df['val']==2) & (df['date']>'2015-07-13')]
df = df[(df['val']==3) & (df['date']>'2016-04-08')]

and so on about 2000 conditions

I use this way, but it too long (about 5 hours). Is there a faster method?

vals = df_condition.val.values
dates = vals = df_condition.dates.values
for i in range(len(df_condition)):
    df = df[~((df.val==vals[i])&(df.date < dates[i]))]

1 Answers1

2

I believe you can create list of all masks and then reduce:

masks = [(df.val==x.val)&(df.date >= x.date) for x in df_condition.itertuples()]
print (masks)
[0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
dtype: bool, 0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
dtype: bool]

df1 = df[np.logical_and.reduce(masks)]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252