Equivalent of R's createDataPartition in Python

Question

I am trying to reproduce the behavior of the R's createDataPartition function in python. I have a dataset for machine learning with the boolean target variable. I would like to split my dataset in a training set (60%) and a testing set (40%).

If I do it totally random, my target variable won't be properly distributed between the two sets.

I achieve it in R using:

inTrain <- createDataPartition(y=data$repeater, p=0.6, list=F)
training <- data[inTrain,]
testing <- data[-inTrain,]

How can I do the same in Python?

PS : I am using scikit-learn as my machine learning lib and python pandas.

Noel Evans · Accepted Answer · 2014-10-27T13:34:03.613

3

In scikit-learn, you get the tool train_test_split

from sklearn.cross_validation import train_test_split
from sklearn import datasets

# Use Age and Weight to predict a value for the food someone chooses
X_train, X_test, y_train, y_test = train_test_split(table['Age', 'Weight'], 
                                                    table['Food Choice'], 
                                                    test_size=0.25)

# Another example using the sklearn pre-loaded datasets:
iris = datasets.load_iris()
X_iris, y_iris = iris.data, iris.target
X, y = X_iris[:, :2], y_iris
X_train, X_test, y_train, y_test = train_test_split(X, y)

This breaks the data in to

inputs for training
inputs for the evaluation data
output for the training data
output for the evaluation data

respectively. You can also add a keyword argument: test_size=0.25 to vary the percentage of the data used for training and testing

To split a single dataset, you can use a call like this to get 40% test data:

>>> data = np.arange(700).reshape((100, 7))
>>> training, testing = train_test_split(data, test_size=0.4)
>>> print len(data)
100
>>> print len(training)
60
>>> print len(testing)
40

edited Oct 27 '14 at 13:34

answered Oct 27 '14 at 13:01

Noel Evans

8,113
8
48
58

1

Does this function understand that it should split the data based on the target/label variable? It is not written anywhere in the documentation. – poiuytrez Oct 27 '14 at 13:08
I added another example where you explicitly choose the variables and target – Noel Evans Oct 27 '14 at 13:12
... And another that randomly breaks the input "data" in to 2 arrays - 60:40 – Noel Evans Oct 27 '14 at 13:23
It's still not clear: is this proper **stratified sampling** or not? – WestCoastProjects Jan 27 '18 at 23:28
@javadba No, it's not stratified sampling; it's randomly sampling. – Noel Evans Feb 15 '18 at 15:53
Thanks. Then does your approach actually mimic the `R`? – WestCoastProjects Feb 15 '18 at 16:55
In approach to sampling, yes. – Noel Evans Feb 15 '18 at 20:33
`sklearn.cross_validation` has been renamed to `sklearn.model_selection`, FYI (see [this answer](https://stackoverflow.com/questions/30667525/importerror-no-module-named-sklearn-cross-validation)) – orange1 Apr 07 '19 at 21:27

score 1 · Answer 2 · answered Feb 06 '20 at 12:10

The correct answer is sklearn.model_selection.StratifiedShuffleSplit

Stratified ShuffleSplit cross-validator

Provides train/test indices to split data into train/test sets.

This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified randomized folds. The folds are made by preserving the percentage of samples for each class.

Note: like the ShuffleSplit strategy, stratified random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets.

score 0 · Answer 3 · answered Feb 27 '19 at 11:35

0

The answer provided is not correct. Apparently there is no function in python that can do stratified sampling, not random sampling, like DataPartition in R does.

answered Feb 27 '19 at 11:35

Irene Del Carmen Pérez-Merbis

11
3

score 0 · Answer 4 · answered Apr 03 '20 at 03:33

As mentioned in the comments, the selected answer does not preserve the class distribution of the data. The scikit-learn docs point out that if is required, then the StratifiedShuffleSplit should be used. This can be done with the train_test_split method with by passing your target array to the stratify option.

>>> import numpy as np
>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split

>>> X, y = datasets.load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)

>>> # show counts of each type after split
>>> print(np.unique(y, return_counts=True))
(array([0, 1, 2]), array([50, 50, 50], dtype=int64))
>>> print(np.unique(y_test, return_counts=True))
(array([0, 1, 2]), array([16, 17, 17], dtype=int64))
>>> print(np.unique(y_train, return_counts=True))
(array([0, 1, 2]), array([34, 33, 33], dtype=int64))

Equivalent of R's createDataPartition in Python

4 Answers4