1

I'm trying to apply a text sorting algorithm and unfortunately I have an error

import sklearn
import numpy as np
from sklearn import svm
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import pandas as pd
import pandas

dataset = pd.read_csv('train.csv', encoding = 'utf-8')
data = dataset['data']
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split (data.data, labels.target, test_size = 0.2, random_state = 0)


vecteur = CountVectorizer()
X_train_counts = vecteur.fit_transform(X_train)

tfidf = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

clf = MultinomialNB().fit(X_train_tfidf, y_train)

#SVM
clf = svm.SVC(kernel = 'linear', C = 10).fit(X_train, y_train)
print(clf.score(X_test, y_test))

I have the following error:

Traceback (most recent call last):

File "bayes_classif.py", line 22, in

dataset = pd.read_csv('train.csv', encoding = 'utf-8')

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 678, in parser_f

return _read(filepath_or_buffer, kwds)

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 446, in _read

data = parser.read(nrows)

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1036, in read

ret = self._engine.read(nrows)

File "/usr/local/lib/python3.6/dist-packages/pandas/io/parsers.py", line 1848, in read

data = self._reader.read(nrows)

File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read

File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory

File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows

File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows

File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 2 fields in line 72, saw 3

My data

data, label
bought noon <product> provence <product> shop givors moment <price> bad surprise <time> made account price <price> catalog expect part minimum refund difference wait read brief delay, refund

parcel ordered friend n still not arrive possible destination send back pay pretty unhappy act gift birth <date> status parcel n not moved weird think lost stolen share quickly solutions can send gift both time good <time>, call

ordered <product> coat recovered calais city europe shops n not used assemble parties up <time> thing done <time> bad surprise parties not aligned correctly can see photo can exchange made refund man, annulation

note <time> important traces rust articles come to buy acting carrying elements going outside extremely disappointed wish to return together immediately full refund indicate procedure sabrina beillevaire <phone_numbers>, refund

note <time> important traces rust articles come to buy acts acting bearing elements going outside extremely disappointed wish to return together immediately full refund indicate procedure <phone_numbers>, annulation

request refund box jewelry arrived completely broken box n not protected free delivery directly packaging plastic item fragile cardboard box <product> interior shot cover cardboard torn corners <product> completely broken, call
marin
  • 923
  • 2
  • 18
  • 26
  • 1
    How does `train.csv` look – Sruthi Aug 17 '18 at 11:38
  • 3
    Possible duplicate of [Python Pandas Error tokenizing data](https://stackoverflow.com/questions/18039057/python-pandas-error-tokenizing-data) – tif Aug 17 '18 at 11:39
  • @SruthiV It is a two-column file, one is the text and the other is the label Exemple: data,label – marin Aug 17 '18 at 11:52
  • 1
    There is some issue with the structure of the csv file.Have you tried `pd.read_csv('train.csv',error_bad_lines=False)`. – Hari Krishnan Aug 17 '18 at 12:07

1 Answers1

1

Can you try to reproduce the same error with a clean code? Yours contain a few mistakes, and unnecessary lines. We also need a sample of your data that helps reproduce the error otherwise we won't be able to help.

Here is what I assume is you are trying to do, please try to launch it with your data and tell us if you still obtain the same error:

import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

dataset = pd.DataFrame({'data':['A first sentence','And a second sentence','Another one','Yet another line','And a last one'],
                    'label':[1,0,0,1,1]})
data = dataset['data']
labels = dataset['label']

X_train, X_test, y_train, y_test = train_test_split (data, labels, test_size = 0.2, random_state = 0)


vecteur = CountVectorizer()
tfidf = TfidfTransformer()

X_train_counts = vecteur.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)
X_test_tfidf = tfidf.transform(vecteur.transform(X_test))

clf = svm.SVC(kernel = 'linear', C = 10).fit(X_train_tfidf, y_train)
print(clf.score(X_test_tfidf, y_test))

EDIT:

According to your data, the error might be due to a comma character in your csv file, causing pandas parser to bug. You can tell pandas to ignore such rows by using erro_bad_lines argument in read_csv. Here is a short example:

temp=u"""data, label
A first working line, refund
a second ok line, call
last line with an inside comma: , character which makes it bug, call"""
df = pd.read_csv(pd.compat.StringIO(temp),error_bad_lines=False)
ysearka
  • 3,805
  • 5
  • 20
  • 41
  • I put some of my data in. I really do not know how to do {'data':['A first sentence','And a second sentence','Another one','Yet another line','And a last one'] because I have a lot of data... – marin Aug 17 '18 at 13:32
  • 1
    No problem, you might want to take a look at your text data, there might be some commas in your column `data` causing pandas parser to fail. (especially in the 72nd line of your file according to the error you gave). – ysearka Aug 17 '18 at 13:35