How to deal with array of string features in traditional machine learning?

Question

Problem

Let's say we have a dataframe that looks like this:

age  job         friends                                    label
23   'engineer'  ['World of Warcraft', 'Netflix', '9gag']   1
35   'manager'   NULL                                       0
...

If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?

Age is pretty straightforward since it is already numerical.
Job can be hashed / indexed since it is a categorical variable.
Friends is a list of categorical variables. How would I go about representing this feature?

Approaches:

Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:

NULL                -> 0
engineer            -> 42069
World of Warcraft   -> 9001
Netflix             -> 14
9gag                -> 9
manager             -> 250

Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.

Approach 1: Hash and Stack

dataframe after feature transformation would look like this:

feature                             label
[23, 42069, 9001, 14, 9, 0, 0]      1
[35, 250,   0,    0,  0, 0, 0]      0

Limitations

Consider the following:

age  job           friends                                        label
23   'engineer'    ['World of Warcraft', 'Netflix', '9gag']       1
35   'manager'      NULL                                          0
26   'engineer'    ['Netflix', '9gag', 'World of Warcraft']       1
...

Compare the features of the first and third record:

feature                             label
[23, 42069, 9001, 14, 9, 0, 0]      1
[35, 250,   0,    0,  0, 0, 0]      0
[26, 42069, 14,    9, 9001, 0]      1

Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.

Approach 2: Hash, Order, and Stack

To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):

feature                             label
[23, 42069, 9001, 14, 9, 0, 0]      1
[35, 250,   0,    0,  0, 0, 0]      0
[26, 42069, 9001, 14, 9, 0, 0]      1

This approach has a limitation too. Consider the following:

age  job           friends                                        label
23   'engineer'    ['World of Warcraft', 'Netflix', '9gag']       1
35   'manager'      NULL                                          0
26   'engineer'    ['Netflix', '9gag', 'World of Warcraft']       1
42   'manager'     ['Netflix', '9gag']                            1
...

Applying feature transform with ordering we get:

row  feature                             label
1    [23, 42069, 9001, 14, 9, 0, 0]      1
2    [35, 250,   0,    0,  0, 0, 0]      0
3    [26, 42069, 9001, 14, 9, 0, 0]      1
4    [44, 250, 14, 9, 0, 0, 0]           1

What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.

Approach 3: Convert Array to Columns

What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?

Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.

Approach 4: One-Hot-Encoding and then Sum

How about this? Convert each hash to one-hot-vector, and add up all these vectors.

In this case, the feature in row one for example would look like this:

[23, 42069, 0^1x8, 1, 0^1x4, 1, 0^1x8986, 1, 0^{1x(max_hash_size-8987)}]

Where 0^1x8 denotes a row of 8 zeros.

The problem with this approach is that these vectors will be very huge and sparse.

Approach 5: Use Embedding Layer and 1D-Conv

With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/

Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.

What are your thoughts on this?

I am currently investigating sparse data representation and I think that might be useful to deal with Approach 4's limitation — tooskoolforkool, Jun 19 '20 at 13:17
It is possible, in a couple of lines of Python, to use a pre-trained word embedding to convert your list of words to a list of vectors. This opens up further ways of preparing the data for use in a traditional ML model, but does it fit your criterion 'works with traditional machine learning'? No need to train a neural net, but you do need to import (e.g.) tensorflow to do the embedding. I can flesh this out to a proper answer if you're interested. — David Harris, Aug 16 '22 at 09:54

score 3 · Answer 1 · answered Aug 17 '22 at 17:55

As another answer mentioned, you've already listed a number of alternatives that could work, depending on the dataset and the model and such.

For what it's worth, a typical logistic regression model that I've encountered would use Approach 3, and convert each of your friends strings into a binary feature. If you're opposed to having 100k features, you could treat these features like a bag-of-words model and discard the stopwords (very common features).

I'll also throw a hashing variant into the mix:

Bloom Filter

You could store the strings in question in a bloom filter for each training example, and use the bits of the bloom filter as a feature in your logistic regression model. This is basically a hashing solution like you've already mentioned, but it takes care of some of the indexing/sorting issues, and provides a more principled tradeoff between sparsity and feature uniqueness.

score 2 · Answer 2 · answered Aug 15 '22 at 20:23

First, there is no definitive answer to this problem, you presented 5 alternatives, and the five are valid, it all depends on the dataset you are using.

Considering this, I will list the options that I find most advantageous. For me option 5 is the best, but as in your case you want to use traditional machine learning techniques, I will discard it. So I would go for option 4, but in this case I need to know if you have the hardware to deal with this problem, if the answer is yes, I would go with this option, considering the answer is no, i would try approach 2, as you pointed out, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array, but not in row 4, but that won't be a problem if you have enough data for training ( again, it all depends on the data available ), even if I have some problems with this approach, I would apply a Data Augmentation technique before discarding it.

Option 1 seems to me the worst, in it you have a great chance of overfitting and certainly a use of a lot of computational resources.

Hope this helps!

score 1 · Answer 3 · answered Feb 06 '23 at 14:39

Approach 1 (Hash and Stack) and 2 (Hash, Order, and Stack) resolve their limitations if the result of the hashing function is considered as the index of a sparse vector with values of 1 instead of the values of each position of the vector.

Then, whenever "World of Warcraft" is in friends array, the feature vector will have a value of 1 in position 9001, regardless of the position of "World of Warcraft" in friends array (limitation of approach 1) and regardless of the existence of other elements in friends array (limitation of approach 2). If "World of Warcraft" is not in friends array, then the value of features vector in position 9001 will most likely be 0 (look up hashing trick collisions to learn more).

score 0 · Answer 4 · answered Jul 21 '22 at 21:12

0

Using word2vec representation (as a feature value), then do a supervised classification also can be a good idea.

answered Jul 21 '22 at 21:12

Arefeh Yavary

82
3