I'm working on a dataset with a Tags column extracted from a stackoverflow dataset. I need to encode these tags to perform a tag prediction using a title and body.
I'm stuck with this encoding, can't get what I need.
Here's a preview of my column :
Tags |
---|
['python', 'authentication', 'login', 'flask', 'python-2.x'] |
['c++', 'vector', 'c++11', 'move', 'deque'] |
... |
And what I'm doing so far :
y_classes = pd.get_dummies(df.Tags)
y_classes
['.net', 'asp.net-mvc', 'visual-studio', 'asp.net-mvc-4', 'intellisense'] | ['.net', 'asp.net-mvc-3', 'linq', 'entity-framework', 'entity-framework-5'] | |
---|---|---|
0 | 0 | 0 |
0 | 0 | 0 |
0 | 0 | 0 |
As you can see, I need to have one column for each tag and not for each unique array of tags. I tried multiple solutions found in StackOverflow but none worked
EDIT : I also tried with MultiLabelBinarizer from sklearn.preprocessing and I had a column for each unique character of Tags column
How can I make this works ?