4

I think I'm missing something in the code below.

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


# Split into training and test sets

# Testing Count Vectorizer

X = df[['Spam']]
y = df['Value']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)


sm =  pd.concat([X_resampled, y_resampled], axis=1)

as I'm getting the error

ValueError: could not convert string to float: ---> 19 X_resampled, y_resampled = SMOTE().fit_resample(X_train, y_train)

Example of data is

Spam                                             Value
Your microsoft account was compromised             1
Manchester United lost against PSG                 0
I like cooking                                     0

I'd consider to transform both train and test sets to fix the issue which is causing the error, but I don't know how to apply to both. I've tried some examples on google, but it hasn't fixed the issue.

Math
  • 191
  • 2
  • 5
  • 19
  • What's the rationale for putting them into a data frame again? The vectorized count is a sparse matrix and it can be really huge on memory if you convert to an array – StupidWolf Dec 13 '20 at 23:21
  • @StupidWolf, just for quality check and split into train set, test set and validation set. I would need also for creating feature vector - document term matrix – Math Dec 13 '20 at 23:24
  • so you need to pass the vectorized counts into smote, as the answer below suggest. I don't think it's wise to put them into a data frame after that. You don't need a dataframe for anything downstream – StupidWolf Dec 13 '20 at 23:38
  • I did something similar when I oversampled the dataset. So I created some functions and used some already built-in which take as parameter the train_set, test_set, and valid_set after splitting the original dataset into train and test. – Math Dec 13 '20 at 23:43
  • `class_1 = tr_set[tr_set.Label == 1] class_0 = tr_set[tr_set.Label == 0] oversample = resample(class_1, replace=True, n_samples=len(class_0), random_state=1) over_train = pd.concat([class_0, oversample])`. How could I do something similar after using SMOTE? Does it not make sense what I did for oversample/I'd like to do for SMOTE?I'm reading a lot of topics on this, papers, websites...Everyone seems to haven't an idea on how to use re-sampling. I'm learning so I'm following what others suggest. – Math Dec 13 '20 at 23:44

3 Answers3

7

convert text data to numeric before applying SMOTE , like below.

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(X_train.values.ravel())
X_train=vectorizer.transform(X_train.values.ravel())
X_test=vectorizer.transform(X_test.values.ravel())
X_train=X_train.toarray()
X_test=X_test.toarray()

and then add your SMOTE code

x_train = pd.DataFrame(X_train)
X_resample, y_resampled = SMOTE().fit_resample(X_train, y_train)
Ravi
  • 2,778
  • 2
  • 20
  • 32
  • what is the shape of X_train? – Ravi Dec 13 '20 at 22:32
  • I think the count vectorizer is not working as intended. how many data points you have in df ? – Ravi Dec 13 '20 at 22:40
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225916/discussion-between-ravi-and-math). – Ravi Dec 13 '20 at 22:42
  • thanks Ravi. However this approach, due to the toarray(), seems not working when I try to concatenate x_resampled and y_resampled into a unique dataset: TypeError: cannot concatenate object of type ''; only Series and DataFrame objs are valid – Math Dec 13 '20 at 23:04
  • add this line before smote calling " x_train = pd.DataFrame(X_train)" – Ravi Dec 13 '20 at 23:47
  • but this line before SMOTE should still use the not sampled data, should it not? Because I would need something that can consider as train set what I will use after SMOTE – Math Dec 13 '20 at 23:49
  • no all we are doing is converting from NumPy array to pandas dataframe. actually, the error occurred because of trying to use NumPy in pandas concatenation. – Ravi Dec 14 '20 at 00:56
1

You can use SMOTENC instead of SMOTE. SMOTENC deals with categorical variables directly.

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html#imblearn.over_sampling.SMOTENC

1

Tokenizing your string data before feeding it into SMOTE is an option. You can use any tokenizer and following torch implementation would be something like:

dataloader = torch.utils.data.DataLoader(dataset, batch_size=64)

X, y = [], []

for batch in dataloader:
    input_ids = batch['input_ids']
    labels = batch['labels']

    X.append(input_ids)
    y.append(labels)

X_tensor = torch.cat(X, dim=0)
y_tensor = torch.cat(y, dim=0)

X = X_tensor.numpy()
y = y_tensor.numpy()

smote = SMOTE(random_state=42, sampling_strategy=0.6)
X_resampled, y_resampled = smote.fit_resample(X, y)
Frtna2
  • 11
  • 1