13

I have applied svm on my dataset. my dataset is multi-label means each observation has more than one label.

while KFold cross-validation it raises an error not in index.

It shows the index from 601 to 6007 not in index (I have 1...6008 data samples).

This is my code:

   df = pd.read_csv("finalupdatedothers.csv")
categories = ['ADR','WD','EF','INF','SSI','DI','others']
X= df[['sentences']]
y = df[['ADR','WD','EF','INF','SSI','DI','others']]
kf = KFold(n_splits=10)
kf.get_n_splits(X)
for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

SVC_pipeline = Pipeline([
                ('tfidf', TfidfVectorizer(stop_words=stop_words)),
                ('clf', OneVsRestClassifier(LinearSVC(), n_jobs=1)),
            ])

for category in categories:
    print('... Processing {} '.format(category))
    # train the model using X_dtm & y
    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])
    print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction)))
    print 'SVM Linear f1 measurement is {} '.format(f1_score(X_test[category], prediction, average='weighted'))
    print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])

Actually, I do not know how to apply KFold cross-validation in which I can get the F1 score and accuracy of each label separately. having looked at this and this did not help me how can I successfully to apply on my case.

for being reproducible, this is a small sample of the data frame the last seven features are my labels including ADR, WD,...

,sentences,ADR,WD,EF,INF,SSI,DI,others
0,"extreme weight gain, short-term memory loss, hair loss.",1,0,0,0,0,0,0
1,I am detoxing from Lexapro now.,0,0,0,0,0,0,1
2,I slowly cut my dosage over several months and took vitamin supplements to help.,0,0,0,0,0,0,1
3,I am now 10 days completely off and OMG is it rough.,0,0,0,0,0,0,1
4,"I have flu-like symptoms, dizziness, major mood swings, lots of anxiety, tiredness.",0,1,0,0,0,0,0
5,I have no idea when this will end.,0,0,0,0,0,0,1

Update

when I did whatever Vivek Kumar said It raises the error

ValueError: Found input variables with inconsistent numbers of samples: [1, 5408]

in classifier part . do you have any idea how to resolve it?

there are a couple of links for this error in stackoverflow which says I need to reshape training data. I also did that but no success link Thanks :)

Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
sariii
  • 2,020
  • 6
  • 29
  • 57
  • Can you elaborate? you wrote that when using KFold you get an error. is this in the code you attached? on what line – ShaharA Aug 15 '18 at 05:50
  • @ShaharA Thanks for the comment. it raises error when it want to do KFold. so early line in the code, the reason I put the whole code here is that to show what purpose later I want to use them. Actually the code working perfectly when I apply train_test_split, but with KFOLD it does not – sariii Aug 15 '18 at 05:53
  • Did you try kf.split(X) without the y inside? – ShaharA Aug 15 '18 at 06:08
  • @ShaharA Yes Actually, it seems it does not relate to that argument – sariii Aug 15 '18 at 06:09
  • I also updated with a small sample of my data frame so it is now reproducible. – sariii Aug 15 '18 at 06:18

1 Answers1

41

train_index, test_index are integer indices based on the number of rows. But pandas indexing dont work like that. Newer versions of pandas are more strict in how you slice or select data from them.

You need to use .iloc to access the data. More information is available here

This is what you need:

for train_index, test_index in kf.split(X,y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    ...
    ...

    # TfidfVectorizer dont work with DataFrame, 
    # because iterating a DataFrame gives the column names, not the actual data
    # So specify explicitly the column name, to get the sentences

    SVC_pipeline.fit(X_train['sentences'], y_train[category])

    prediction = SVC_pipeline.predict(X_test['sentences'])
Vivek Kumar
  • 35,217
  • 8
  • 109
  • 132
  • 1
    Thank you so much for the answer, then how can I figure out when passing training to the classifier? with this I got this error: ValueError: Found input variables with inconsistent numbers of samples: [1, 5408] at this line SVC_pipeline.fit(X_train, y_train[category]). thanks for taking the time – sariii Aug 15 '18 at 07:53
  • 1
    Again thanks, Actually I have also tried this way but it raises error: – sariii Aug 16 '18 at 05:58
  • 1
    File "pandas\_libs\index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: 'ADR' – sariii Aug 16 '18 at 05:58
  • 1
    even X_train and y_train is kind of yellow shows there is something wrong with them but im not able to figure it out . I also saw this post of you here https://stackoverflow.com/questions/44429600/indexing-a-csv-running-into-inconsistent-number-of-samples-for-logistic-regressi but it did not help :| – sariii Aug 16 '18 at 06:01
  • @sariaGoudarzi On your sample data and code you provided in the question above, I am not getting this error. Please update the question with the current complete code (that you used after taking hints from this answer) and also the complete stack trace of error. – Vivek Kumar Aug 16 '18 at 06:09
  • Thats too weird actually, I again copy pasted the whole code I have, and the data frame is really the same – sariii Aug 16 '18 at 06:13
  • I also copy paste the first 5 rows of my data. the weird thing is that even before running the code X_train['sentences'], y_train in fitting is already yellow shows something wrong. – sariii Aug 16 '18 at 06:15
  • @sariaGoudarzi You added a new line in there: `print([{X_test[i]: categories[prediction[i]]} for i in range(len(list(prediction)))])`. The error is from this. X_train and X_test are dataframes, indexing dont work on them like you want. – Vivek Kumar Aug 16 '18 at 06:15
  • Oh sorry I did not realize that might be a problem as in the error it point out to print('SVM Linear Test accuracy is {} '.format(accuracy_score(X_test[category], prediction))) . I have no idea what is my option if i can not use indexing test_x[i]. may I ask you please help with this part also . sorry I know its not part of this question – sariii Aug 16 '18 at 06:20
  • @sariaGoudarzi Use iloc. `print([(X_test['sentences'].iloc[i], categories[prediction[i]]) for i in range(len(list(prediction)))])` – Vivek Kumar Aug 16 '18 at 06:23
  • raises error :| should not I change the accuracy _score part also? – sariii Aug 16 '18 at 06:28
  • @sariaGoudarzi Yes. Replace `X_test[category]` with `y_test[category]` in both lines. – Vivek Kumar Aug 16 '18 at 06:30
  • I just figured it out right now, my mistake this one so sorry for that, greatly appreciate your help and patience. best of luck :) – sariii Aug 16 '18 at 06:33