I am processing a dataframe with a click-stream and I'm extracting features for each user in the click-stream to be used in a Machine Learning project.
The dataframe is something like this:
data = pd.DataFrame({'id':['A01','B01','A01','C01','A01','B01','A01'],
'event':['search','search','buy','home','cancel','home','search'],
'date':['2018-01-01','2018-01-01','2018-01-02','2018-01-03','2018-01-04','2018-01-04','2018-01-06'],
'product':['tablet','dvd','tablet','tablet','tablet','book','book'],
'price': [103,2,203,103,203,21,21]})
data['date'] = pd.to_datetime(data['date'])
Since I have to create features for each user I'm using a groupby/apply with a custom function like:
featurized = data.groupby('id').apply(featurize)
Create user features will take a chunk of the dataframe and create many (hundreds) of features. The whole process is just too slow so I'm looking for a recommendation to do this more effciently.
An example of the function used to create features:
def featurize(group):
features = dict()
# Userid
features['id'] = group['id'].max()
# Feature 1: Number of search events
features['number_of_search_events'] = (group['event']=='search').sum()
# Feature 2: Number of tablets
features['number_of_tablets'] = (group['product']=='tablet').sum()
# Feature 3: Total time
features['total_time'] = (group['date'].max() - group['date'].min()) / np.timedelta64(1,'D')
# Feature 4: Total number of events
features['events'] = len(group)
# Histogram of products examined
product_counts = group['product'].value_counts()
# Feature 5 max events for a product
features['max_product_events'] = product_counts.max()
# Feature 6 min events for a product
features['min_product_events'] = product_counts.min()
# Feature 7 avg events for a product
features['mean_product_events'] = product_counts.mean()
# Feature 8 std events for a product
features['std_product_events'] = product_counts.std()
# Feature 9 total price for tablet products
features['tablet_price_sum'] = group.loc[group['product']=='tablet','price'].sum()
# Feature 10 max price for tablet products
features['tablet_price_max'] = group.loc[group['product']=='tablet','price'].max()
# Feature 11 min price for tablet products
features['tablet_price_min'] = group.loc[group['product']=='tablet','price'].min()
# Feature 12 mean price for tablet products
features['tablet_price_mean'] = group.loc[group['product']=='tablet','price'].mean()
# Feature 13 std price for tablet products
features['tablet_price_std'] = group.loc[group['product']=='tablet','price'].std()
return pd.Series(features)
One potential problem is that each feature potentially scans the whole chunk so if I have 100 features I scan the chunk 100 times instead of just one.
For example a feature can be the number of "tablet" events the user has, other can be the number of "home" events, other can be the average time difference between "search" events, then average time difference between "search" events for "tablets", etc etc. Each feature can be coded as a function that takes a chunk (df) and creates the feature but when we have 100s of features each is scanning the whole chunk when a single linear scan would suffice. The problem is the code would get ugly if I do a manual for loop over each record in the chunk and code all the features in the loop.
Questions:
If I have to process a dataframe hundreds of times, is there a way to abstract this in a single scan that will create all the needed features?
Is there a speed improvement over the groupby/apply approach I'm currently using?