6

My data is slightly unbalanced, so I am trying to do a SMOTE algorithm before doing the logistic regression model. When I do, I get the error: KeyError: 'Only the Series name can be used for the key in Series dtype mappings.' Could someone help me figure out why? Here is the code:

X = dummies.loc[:, dummies.columns != 'Count']
y = dummies.loc[:, dummies.columns == 'Count']
#from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train, y_train) # here is where it errors
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['Count'])

Thank you!

devdon
  • 101
  • 1
  • 1
  • 4

4 Answers4

17

I just encountered this problem myself. As it turned out, I had a duplicate column in my dataset. Perhaps double check that this is not the case for your dataset.

Maxime
  • 171
  • 2
2

This error is mainly due to the fact that you have duplicate columns in your data. To check for duplicate columns, use:

df.head()

or df.columns

To fix, drop columns using:

df.drop('column_name', axis=1, inplace=True) 

to drop the duplicated column(s).

Beta Ways
  • 21
  • 5
1

I actually just fixed this problem! I made them matrices: os_data_X,os_data_y=os.fit_sample(X_train.as_matrix(), y_train.as_matrix())

devdon
  • 101
  • 1
  • 1
  • 4
  • 1
    as_matrix is deprecated for more recent versions of pandas. This thread https://stackoverflow.com/questions/13187778/convert-pandas-dataframe-to-numpy-array recommends to_numpy or values. – Evelin Amorim Feb 01 '21 at 17:12
1

100% correct solution.

Try to convert your X features into an array first and then feed to SMOTE:

sm = SMOTE()

X=np.array(X)

X, y = sm.fit_sample(X, y.ravel())