0

I am trying to work on a classification problem: The data is of reviews of a particular product category from an e-commerce platform. Please find below the description of each attribute:

  • id: Unique identifier for each tuple.
  • category: The reviews have been categorized into two categories representing positive and negative reviews. 0 represents positive reviews and 1 represents negative reviews.
  • text: Tokenized text content of the review.

The sample dataset is attached in the picture.

image contains the training data format which consists of the above said columns

I am thinking to try TF-IDF however, given the text format don't know how to use the same.

I expect to predict the category based on the text column provided.

Toby Speight
  • 27,591
  • 48
  • 66
  • 103

1 Answers1

0

You can use the column textas several features, I would recommend you to split that column (How do I split a string into several columns in a dataframe with pandas Python?):

#first load dataframe (I assume it is excel format)
import pandas as pd
df = pd.read_excel('YourPath', header=True)
df['Text'].str.split('', expand=True)

then you can conver it to a (0,1) dataframe:

df1 = (pd.get_dummies(df.set_index(['id', 'category']).stack())
         .max(level=0)
         .rename(columns=int)
         .reset_index())

this will leads to something like:

id category 5002  7400 ....
 1    A         1     0 .....
 2   B         0     1

where the columns are the values from your dataframe, and only filled if the value exists in that category

PV8
  • 5,799
  • 7
  • 43
  • 87
  • Yes, but if I split them then the length of the data is not constant and hence will get multiple columns with empty values of maximum rows. – Dhrub Satyam Jha Sep 20 '19 at 10:21
  • you can check out this for the conversion: https://stackoverflow.com/questions/58027455/transform-cell-values-as-column-headers-and-fill-it-with-1-if-matching-in-python – PV8 Sep 20 '19 at 11:45
  • @DhrubSatyamJha did you get any solution? – Taylor Sep 23 '19 at 08:31