How can I apply classification algorithm for text data which is in the form of numerical tokens?

Question

I am trying to work on a classification problem: The data is of reviews of a particular product category from an e-commerce platform. Please find below the description of each attribute:

id: Unique identifier for each tuple.
category: The reviews have been categorized into two categories representing positive and negative reviews. 0 represents positive reviews and 1 represents negative reviews.
text: Tokenized text content of the review.

The sample dataset is attached in the picture.

image contains the training data format which consists of the above said columns

I am thinking to try TF-IDF however, given the text format don't know how to use the same.

I expect to predict the category based on the text column provided.

PV8 · Answer 1 · 2019-09-20T11:42:39.750

0

You can use the column textas several features, I would recommend you to split that column (How do I split a string into several columns in a dataframe with pandas Python?):

#first load dataframe (I assume it is excel format)
import pandas as pd
df = pd.read_excel('YourPath', header=True)
df['Text'].str.split('', expand=True)

then you can conver it to a (0,1) dataframe:

df1 = (pd.get_dummies(df.set_index(['id', 'category']).stack())
         .max(level=0)
         .rename(columns=int)
         .reset_index())

this will leads to something like:

id category 5002  7400 ....
 1    A         1     0 .....
 2   B         0     1

where the columns are the values from your dataframe, and only filled if the value exists in that category

edited Sep 20 '19 at 11:42

answered Sep 20 '19 at 09:47

PV8

5,799
7
43
87

Yes, but if I split them then the length of the data is not constant and hence will get multiple columns with empty values of maximum rows. – Dhrub Satyam Jha Sep 20 '19 at 10:21
you can check out this for the conversion: https://stackoverflow.com/questions/58027455/transform-cell-values-as-column-headers-and-fill-it-with-1-if-matching-in-python – PV8 Sep 20 '19 at 11:45
@DhrubSatyamJha did you get any solution? – Taylor Sep 23 '19 at 08:31

How can I apply classification algorithm for text data which is in the form of numerical tokens?

1 Answers1