0

I have a data points in a csr numpy matrix and labels in a pandas series.

I want to do down sampling of the dataset.

I tried re-sampling the data points(matrix) and labels(pandas series) separately using same random state.

X4_train_undersampled = resample(X4_train,replace=False, n_samples=41615, random_state=123) 
y_train_undersampled = resample(y_train, replace=False , n_samples=41615, random_state=123)

I want to whether this is the right method to do it.

if yes, how can i test if the same rows are sampled in data points and labels.

if No, please provide another way to do down-sampling.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
sai
  • 13
  • 3
  • 1
    See https://stackoverflow.com/a/55361538/4333359. For `DataFrame.resample()` the only thing that matters is the length of the data frame, the seed and the value of replace. IF those are all the same, the `iloc` indices it selects will be identical. So long as the DataFrames are indexed the same, it will select the same rows. – ALollz Sep 24 '19 at 15:45
  • And just to be a bit more clear, if you use `numpy.random.choice` to generate the indices (call them `idx`) (see the bottom part of my solution in the link) then you can be extremely transparent about selecting the exact same rows by doing `your_df.iloc[idx]` for each DataFrame you would like to resample. – ALollz Sep 24 '19 at 15:59

0 Answers0