Problem
Let's say we have a dataframe that looks like this:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
...
If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?
- Age is pretty straightforward since it is already numerical.
- Job can be hashed / indexed since it is a categorical variable.
- Friends is a list of categorical variables. How would I go about representing this feature?
Approaches:
Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:
NULL -> 0
engineer -> 42069
World of Warcraft -> 9001
Netflix -> 14
9gag -> 9
manager -> 250
Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.
Approach 1: Hash and Stack
dataframe after feature transformation would look like this:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
Limitations
Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
...
Compare the features of the first and third record:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 14, 9, 9001, 0] 1
Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.
Approach 2: Hash, Order, and Stack
To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 9001, 14, 9, 0, 0] 1
This approach has a limitation too. Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
42 'manager' ['Netflix', '9gag'] 1
...
Applying feature transform with ordering we get:
row feature label
1 [23, 42069, 9001, 14, 9, 0, 0] 1
2 [35, 250, 0, 0, 0, 0, 0] 0
3 [26, 42069, 9001, 14, 9, 0, 0] 1
4 [44, 250, 14, 9, 0, 0, 0] 1
What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.
Approach 3: Convert Array to Columns
What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?
Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.
Approach 4: One-Hot-Encoding and then Sum
How about this? Convert each hash to one-hot-vector, and add up all these vectors.
In this case, the feature in row one for example would look like this:
[23, 42069, 01x8, 1, 01x4, 1, 01x8986, 1, 01x(max_hash_size-8987)]
Where 01x8 denotes a row of 8 zeros.
The problem with this approach is that these vectors will be very huge and sparse.
Approach 5: Use Embedding Layer and 1D-Conv
With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/
Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.
What are your thoughts on this?