I'm working on a movie dataset which contains genre as a feature. The examples in the dataset may belong to multiple genres at the same time. So, they contain a list of genre labels.
The data looks like this-
movieId genres
0 1 [Adventure, Animation, Children, Comedy, Fantasy]
1 2 [Adventure, Children, Fantasy]
2 3 [Comedy, Romance]
3 4 [Comedy, Drama, Romance]
4 5 [Comedy]
I want to vectorize this feature. I have tried LabelEncoder and OneHotEncoder, but they can't seem to handle these lists directly.
I could vectorize this manually, but I have other similar features that contain too many categories. For those I'd prefer some way to use the FeatureHasher class directly.
Is there some way to get these encoder classes to work on such a feature? Or is there a better way to represent such a feature that will make encoding easier? I'd gladly welcome any suggestions.