Sampling data such that distribution is preserved

Question

vsample_data = credit_card.sample(n=100, replace='False')

print(vsample_data)

Here, I was trying to sample 100 data points from a data set but not able to get correct sample data such that it preserves the original distribution of the credit card fraud data-set i.e Class-0( Non- Fraud) and Class-1(Fraud).

Look at using sklearn.train_test_split with `stratify=Y` http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn-model-selection-train-test-split — Scott Boston, Apr 24 '18 at 15:36
I was getting random samples of Sometimes Class-0 there only and very few times Class-1 data points present — Dhruv Bhardwaj, Apr 24 '18 at 15:36
Is there any one liner for this so that I get Both classes with same distribution as the original data — Dhruv Bhardwaj, Apr 24 '18 at 15:38
Possible duplicate of [Stratified samples from Pandas](https://stackoverflow.com/questions/41035187/stratified-samples-from-pandas) — fabianegli, Apr 24 '18 at 15:50
This might help: https://stackoverflow.com/a/41036118/6018688 — fabianegli, Apr 24 '18 at 15:53
I just have to sample the data such that the sampled data's distribution is same as original data and one more thing the data is highly imbalanced — Dhruv Bhardwaj, Apr 24 '18 at 16:11

score 0 · Answer 1 · answered Apr 24 '18 at 22:15

Increase your sample size (n>>100). The data you are sampling from is itself a random sample. Creating a subset through random selection is itself a random process. If one of the data classes has a low frequency then the problem is that your sample size (100) is too low.

If you change the replace flag to 'True' and do repeated samples, you are doing something called bootstrapping. Assuming the complete data set represents the true population distribution this resampling will give you examples of what kind of measurements you might get for lower values of n (n=100).

The alternative is a stratification strategy as suggested by some above. However, you are not creating random subsets when you do this, and the assumption of distribution is now built into your smaller data sets. Note that you can only achieve this after having looked at the entire data set to determine its distribution. Probably not what you want.

If you are creating a (supervised) training data set from the data you can repeat under-represented data to manipulate the bias.

Sampling data such that distribution is preserved

1 Answers1