I have a csv file in which there are 10 columns. My project is to classify the reviews in my file as good or bad using nlp. When I do tokenisation of the column in which reviews is stored (review text column) using re.sub method it is giving an error called 'expected string or bytes like object'.
I have attached my csv file and also the code that i have tried in jupyter note book.
This is my data file.
My code is like this for now and the error is in the 're.sub' line
import numpy as np
import pandas as pd
import nltk
import matplotlib
dataset = pd.read_csv("C:/Users/a/Downloads/data.tsv", delimiter = "\t", quoting = 1)
dataset.head()
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review Text'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in
set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
How do I correct my error? The next steps I want to do is vectorisation, training and classification.