0

I'm working on a dataset with a Tags column extracted from a stackoverflow dataset. I need to encode these tags to perform a tag prediction using a title and body.

I'm stuck with this encoding, can't get what I need.

Here's a preview of my column :

Tags
['python', 'authentication', 'login', 'flask', 'python-2.x']
['c++', 'vector', 'c++11', 'move', 'deque']
...

And what I'm doing so far :

    y_classes = pd.get_dummies(df.Tags)
    y_classes
['.net', 'asp.net-mvc', 'visual-studio', 'asp.net-mvc-4', 'intellisense'] ['.net', 'asp.net-mvc-3', 'linq', 'entity-framework', 'entity-framework-5']
0 0 0
0 0 0
0 0 0

As you can see, I need to have one column for each tag and not for each unique array of tags. I tried multiple solutions found in StackOverflow but none worked

EDIT : I also tried with MultiLabelBinarizer from sklearn.preprocessing and I had a column for each unique character of Tags column

How can I make this works ?

  • 1
    Please include any relevant information [as text directly into your question](https://stackoverflow.com/editing-help), do not link or embed external images of source code or data. Images make it difficult to efficiently assist you as they cannot be copied and offer poor usability to others as they cannot be searched. See: [Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/q/285551/15497888). Also, please see [How make a minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) and [How to make good reproducible pandas – AlexK Aug 22 '22 at 23:59
  • continued: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391) – AlexK Aug 23 '22 at 00:06
  • @AlexK Thanks for your feedback, I updated the post with tables and code – Thibaut Penhard Aug 24 '22 at 08:18
  • This is the expected behavior for `pandas.get_dummies`. sklearn's `MultiLabelBinarizer` is meant for this use-case; if you provide the details of what you tried and what resulted we can probably diagnose that. – Ben Reiniger Aug 24 '22 at 13:09
  • Looks like you could have made MultiLabelBinarizer work: the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html) say that you end up with the column for each character if you pass in a `list` but can solve it by using `set` or nested `list` instead. – OllieStanley Sep 01 '22 at 08:23

1 Answers1

0

Ok, so I figured out myself how to fix this problem so here is my solution if :

    tags_array=df['Tags'].to_numpy()
    df2 = pd.DataFrame(tags_array, columns=['Tags'])

    coun_vect = CountVectorizer()
    count_matrix  = coun_vect.fit_transform(df2["Tags"])
    count_array = count_matrix.toarray()

    df2 = pd.DataFrame(data=count_array,columns = 
    coun_vect.get_feature_names())
    print(df2)

output :

ajax algorithm amazon android angular ...
0 0 0 1 0 ...
1 1 0 0 0 ...
0 0 1 0 1 ...
... ... ... ... ... ...

Edit :

Like @OllieStanley said in a comment, it could have worked with multilabelBinarizer, the problem was the dataset considered as a list and could be solved by using set or nested list instead