resampling data - using SMOTE from imblearn with 3D numpy arrays

Question

I want to resample my dataset. This consists in categorical transformed data with labels of 3 classes. The amount of samples per class are:

counts of class A: 6945
counts of class B: 650
counts of class C: 9066
TOTAl samples: 16661

The data shape without labels is (16661, 1000, 256). This means 16661 samples of (1000,256). What I would like is to up-sampling the data up to the number of samples from the majority class, that is, class A -> (6945)

However, when calling:

from imblearn.over_sampling import SMOTE
print(categorical_vector.shape)
sm = SMOTE(random_state=2)
X_train_res, y_labels_res = sm.fit_sample(categorical_vector, labels.ravel())

It keeps saying ValueError: Found array with dim 3. Estimator expected <= 2.

How can I flatten the data in a way that the estimator could fit it and that it makes sense too? Furthermore, how can I unflatten (with 3D dimension) after getting X_train_res?

it is possible to convert 3d to 2d array and then back to 3d again. But you need to know the shape of 2d array (your required shape for 2d array), then it could proceed. — Abdur Rehman, May 14 '19 at 08:02

score 5 · Accepted Answer · answered May 14 '19 at 11:35

5

I am considering a dummy 3d array and assuming a 2d array size by myself,

arr = np.random.rand(160, 10, 25)
orig_shape = arr.shape
print(orig_shape)

Output: (160, 10, 25)

arr = np.reshape(arr, (arr.shape[0], arr.shape[1]))
print(arr.shape)

Output: (4000, 10)

arr = np.reshape(arr, orig_shape))
print(arr.shape)

Output: (160, 10, 25)

answered May 14 '19 at 11:35

Abdur Rehman

1,071
1
9
15

@sanchezjAI you can apply this code to your own situation. Let me know if it worked. – Abdur Rehman May 14 '19 at 11:36

Gábor Kőrösi · Answer 2 · 2020-08-05T12:27:32.903

from imblearn.over_sampling 
import RandomOverSampler 
import numpy as np 
oversample = RandomOverSampler(sampling_strategy='minority')

X could be a time stepped 3D data like X[sample,time,feature], and y like binary values for each sample. For example: (1,1),(2,1),(3,1) -> 1

X = np.array([[[1,1],[2,1],[3,1]],
             [[2,1],[3,1],[4,1]],
             [[5,1],[6,1],[7,1]],
             [[8,1],[9,1],[10,1]],
             [[11,1],[12,1],[13,1]]
             ])

y = np.array([1,0,1,1,0])

There is no way to train OVERSAMPLER with 3D X values because if you use 2D you will get back 2D data.

Xo,yo = oversample.fit_resample(X[:,:,0], y)
Xo:
[[ 1  2  3]
 [ 2  3  4]
 [ 5  6  7]
 [ 8  9 10]
 [11 12 13]
 [ 2  3  4]]

yo:
[1 0 1 1 0 0]

but if you use 2D data (sample,time,0) to fit the model, it will give back indices, and it is enough to create 3D oversampled data

oversample.fit_resample(X[:,:,0], y)
Xo = X[oversample.sample_indices_]
yo = y[oversample.sample_indices_]

Xo:
[[[ 1  1][ 2  1][ 3  1]]
 [[ 2  1][ 3  1][ 4  1]]
 [[ 5  1][ 6  1][ 7  1]]
 [[ 8  1][ 9  1][10  1]]
 [[11  1][12  1][13  1]]
 [[ 2  1][ 3  1][ 4  1]]]
yo:
[1 0 1 1 0 0]

score 1 · Answer 3 · answered Oct 31 '20 at 14:18

I will create each point for a 2-dim array and then reshape it as 3 dim array. I have provided my scripts. If there is any confusion, comment; please reply.

x_train, y_train = zip(*train_dataset)
x_test, y_test = zip(*test_dataset)

dim_1 = np.array(x_train).shape[0]
dim_2 = np.array(x_train).shape[1]
dim_3 = np.array(x_train).shape[2]

new_dim = dim_1 * dim_2

new_x_train = np.array(x_train).reshape(new_dim, dim_3)


new_y_train = []
for i in range(len(y_train)):
    # print(y_train[i])
    new_y_train.extend([y_train[i]]*dim_2)

new_y_train = np.array(new_y_train)

# transform the dataset
oversample = SMOTE()
X_Train, Y_Train = oversample.fit_sample(new_x_train, new_y_train)
# summarize the new class distribution
counter = Counter(Y_Train)
print('The number of samples in TRAIN: ', counter)



x_train_SMOTE = X_Train.reshape(int(X_Train.shape[0]/dim_2), dim_2, dim_3)

y_train_SMOTE = []
for i in range(int(X_Train.shape[0]/dim_2)):
    # print(i)
    value_list = list(Y_Train.reshape(int(X_Train.shape[0]/dim_2), dim_2)[i])
    # print(list(set(value_list)))
    y_train_SMOTE.extend(list(set(value_list)))
    ## Check: if there is any different value in a list 
    if len(set(value_list)) != 1:
        print('\n\n********* STOP: THERE IS SOMETHING WRONG IN TRAIN ******\n\n')
    


dim_1 = np.array(x_test).shape[0]
dim_2 = np.array(x_test).shape[1]
dim_3 = np.array(x_test).shape[2]

new_dim = dim_1 * dim_2

new_x_test = np.array(x_test).reshape(new_dim, dim_3)


new_y_test = []
for i in range(len(y_test)):
    # print(y_train[i])
    new_y_test.extend([y_test[i]]*dim_2)

new_y_test = np.array(new_y_test)

# transform the dataset
oversample = SMOTE()
X_Test, Y_Test = oversample.fit_sample(new_x_test, new_y_test)
# summarize the new class distribution
counter = Counter(Y_Test)
print('The number of samples in TEST: ', counter)



x_test_SMOTE = X_Test.reshape(int(X_Test.shape[0]/dim_2), dim_2, dim_3)

y_test_SMOTE = []
for i in range(int(X_Test.shape[0]/dim_2)):
    # print(i)
    value_list = list(Y_Test.reshape(int(X_Test.shape[0]/dim_2), dim_2)[i])
    # print(list(set(value_list)))
    y_test_SMOTE.extend(list(set(value_list)))
    ## Check: if there is any different value in a list 
    if len(set(value_list)) != 1:
        print('\n\n********* STOP: THERE IS SOMETHING WRONG IN TEST ******\n\n')

resampling data - using SMOTE from imblearn with 3D numpy arrays

3 Answers3

Linked