I'm trying to set a boolean column based on a sequence of conditions in a dataframe. I'm testing multiple conditions, and where I experience the most problems are comparing two list objects. I know how to test whether or not all items in the smaller list are all in a larger list, and I know how to set a boolean column based on multiple conditions/columns across a dataframe. What I can't seem to do is get them to play nice together--when I try to iterate through the lists, they end up iterating through the column rows, rather than the objects in the list in each row. I've already written a for loop to assign values per-row, but it is incredibly slow and I'll have a minimum of 1 million rows to iterate over, so I need speed.
Here's a solution I've tested. In this scenario, 'success' is attributed to event1 when...
- The sequence of events are done by the same user.
- Event 1 is immediately followed by event 3.
- The tag list of event 1 is not empty.
- Every tag in the list of event 1 tags is with a list of tags in event 3.
I'll be expanding conditions to other values not listed in this problem, but bonus points awarded if your answer includes conditional logic checking for a single value in a list of values [or it could be a sequence of columns], and/or checking that all values in a list/sequence are not None/np.NaN.
df['success'] = np.where((
(df.user_id==df.user_id.shift(-1)) &
(df.event_id==1) &
(df.event_id.shift(-1)==3) &
(len(df.event1_tags)>0) & # breaks because it's counting the rows in pd.Series
(all(e in df.event3_tags.shift(-1) for e in df.event1_tags)) # breaks because it iterates through both columns as Series
), 1, 0)
Here are two stackoverflow articles that have helped me, and a toy dataframe, followed by the desired output with the toy dataframe.
Pandas: How do I assign values based on multiple conditions for existing columns?
Checking if List contains all items from another list
data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']]}
df = pd.DataFrame(data)
df
user_id event_id event1_tags event3_tags
0 1 1 [tag1] NaN
1 1 1 [] NaN
2 1 3 NaN [tag1, tag2, tag3]
3 2 1 [tag2, tag3] NaN
4 2 3 NaN [tag1, tag2, tag3]
5 2 3 NaN [tag1, tag2, tag3]
6 3 1 [tag2, tag4] NaN
7 3 3 NaN [tag1, tag2, tag3]
8 3 3 NaN [tag1, tag2, tag3]
data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'],
np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']],
'success' : [0, 0, 0, 1, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
df
user_id event_id event1_tags event3_tags success
0 1 1 [tag1] NaN 0
1 1 1 [] NaN 0
2 1 3 NaN [tag1, tag2, tag3] 0
3 2 1 [tag2, tag3] NaN 1
4 2 3 NaN [tag1, tag2, tag3] 0
5 2 3 NaN [tag1, tag2, tag3] 0
6 3 1 [tag2, tag4] NaN 0
7 3 3 NaN [tag1, tag2, tag3] 0
8 3 3 NaN [tag1, tag2, tag3] 0