0

I'm trying to set a boolean column based on a sequence of conditions in a dataframe. I'm testing multiple conditions, and where I experience the most problems are comparing two list objects. I know how to test whether or not all items in the smaller list are all in a larger list, and I know how to set a boolean column based on multiple conditions/columns across a dataframe. What I can't seem to do is get them to play nice together--when I try to iterate through the lists, they end up iterating through the column rows, rather than the objects in the list in each row. I've already written a for loop to assign values per-row, but it is incredibly slow and I'll have a minimum of 1 million rows to iterate over, so I need speed.

Here's a solution I've tested. In this scenario, 'success' is attributed to event1 when...

  1. The sequence of events are done by the same user.
  2. Event 1 is immediately followed by event 3.
  3. The tag list of event 1 is not empty.
  4. Every tag in the list of event 1 tags is with a list of tags in event 3.

I'll be expanding conditions to other values not listed in this problem, but bonus points awarded if your answer includes conditional logic checking for a single value in a list of values [or it could be a sequence of columns], and/or checking that all values in a list/sequence are not None/np.NaN.

df['success'] = np.where((
    (df.user_id==df.user_id.shift(-1)) & 
    (df.event_id==1) & 
    (df.event_id.shift(-1)==3) &
    (len(df.event1_tags)>0) & # breaks because it's counting the rows in pd.Series
    (all(e in df.event3_tags.shift(-1) for e in df.event1_tags)) # breaks because it iterates through both columns as Series
                         ), 1, 0)

Here are two stackoverflow articles that have helped me, and a toy dataframe, followed by the desired output with the toy dataframe.

Pandas: How do I assign values based on multiple conditions for existing columns?

Checking if List contains all items from another list

data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
        'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
        'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']]}
df = pd.DataFrame(data)
df

    user_id event_id    event1_tags     event3_tags
0   1       1           [tag1]          NaN
1   1       1           []              NaN
2   1       3           NaN             [tag1, tag2, tag3]
3   2       1           [tag2, tag3]    NaN
4   2       3           NaN             [tag1, tag2, tag3]
5   2       3           NaN             [tag1, tag2, tag3]
6   3       1           [tag2, tag4]    NaN
7   3       3           NaN             [tag1, tag2, tag3]
8   3       3           NaN             [tag1, tag2, tag3]
data = {'user_id' : [1, 1, 1, 2, 2, 2, 3, 3, 3],
        'event_id' : [1, 1, 3, 1, 3, 3, 1, 3, 3],
        'event1_tags' : [['tag1'], [], np.NaN, ['tag2', 'tag3'], np.NaN, np.NaN, ['tag2', 'tag4'], np.NaN, np.NaN],
        'event3_tags' : [np.NaN, np.NaN, ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3'], 
                         np.NaN, ['tag1', 'tag2', 'tag3'], ['tag1', 'tag2', 'tag3']],
        'success' : [0, 0, 0, 1, 0, 0, 0, 0, 0]}
df = pd.DataFrame(data)
df

    user_id event_id    event1_tags     event3_tags         success
0   1       1           [tag1]          NaN                 0
1   1       1           []              NaN                 0
2   1       3           NaN             [tag1, tag2, tag3]  0
3   2       1           [tag2, tag3]    NaN                 1
4   2       3           NaN             [tag1, tag2, tag3]  0
5   2       3           NaN             [tag1, tag2, tag3]  0
6   3       1           [tag2, tag4]    NaN                 0
7   3       3           NaN             [tag1, tag2, tag3]  0
8   3       3           NaN             [tag1, tag2, tag3]  0

1 Answers1

0

This is my current solution. As it turns out, it's not as slow as I expected it to be (but I do know it's slow). I'm still interested in faster solutions if anybody has tips.

def get_conversion(df):
    event_dataframe['success'] = 0
    for i in df.itertuples():
        current_idx = i[0]
        next_idx = i[0]+1
        if ((next_idx in df.index)
            and (df['user_id'][current_idx]==df['user_id'][next_idx]) 
            and (df['event_id'][current_idx]==1) 
            and (df['event_id'][next_idx]==3) 
            and (len(df['event1_tags'][current_idx])!=0) 
            and (all(t in df['event3_tags'][next_idx] for t in df['event1_tags'][current_idx]))
             ):
            df.loc[current_idx, ['success']] = 1
        else:
            pass
    return df

df = get_conversion(df)