0

I have a csv file in which there are 10 columns. My project is to classify the reviews in my file as good or bad using nlp. When I do tokenisation of the column in which reviews is stored (review text column) using re.sub method it is giving an error called 'expected string or bytes like object'.

I have attached my csv file and also the code that i have tried in jupyter note book.

This is my data file.

My code is like this for now and the error is in the 're.sub' line

import numpy as np
import pandas as pd
import nltk
import matplotlib

dataset = pd.read_csv("C:/Users/a/Downloads/data.tsv", delimiter = "\t", quoting = 1)
dataset.head()

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
  review = re.sub('[^a-zA-Z]', ' ', dataset['Review Text'][i])
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  review = [ps.stem(word) for word in review if not word in 
  set(stopwords.words('english'))]
  review = ' '.join(review)
  corpus.append(review)

How do I correct my error? The next steps I want to do is vectorisation, training and classification.

halfer
  • 19,824
  • 17
  • 99
  • 186
  • 1
    The issue is that some of the data in your data file are converted by Pandas to actual ints and floats. Those then are not strings and that is throwing your error. – dawg Dec 07 '19 at 16:44

2 Answers2

2

The source of your problem are cells with empty content, which by default are read by read_csv as NaN, which are a "special case" of float.

On the other hand, re.sub neeeds a string data (not float).

One of possible solutions is to replace all NaN values wit an empty string:

df['Review Text'] = df['Review Text'].replace(np.nan, '')

and then call re.sub.

Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41
  • this helped, thank you soo much. could you please help me in another step where i am splitting the data into taining and testing set using the code from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42) but it gives an error saying "Found input variables with inconsistent numbers of samples: [1000, 23486]" @Valdi_Bo –  Dec 07 '19 at 17:29
0

use

review = re.sub('[^a-zA-Z]', ' ', str(dataset['Review Text'][i]))

instead of

review = re.sub('[^a-zA-Z]', ' ', dataset['Review Text'][i])
ArunJose
  • 1,999
  • 1
  • 10
  • 33
  • this helped, thank you soo much. could you please help me in another step where i am splitting the data into taining and testing set using the code from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42) but it gives an error saying "Found input variables with inconsistent numbers of samples: [1000, 23486]" –  Dec 07 '19 at 17:32
  • Make sure your X and y has same number of rows. – ArunJose Dec 07 '19 at 18:35