Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions

129

votes

13 answers

Keras split train test set when using ImageDataGenerator

I have a single directory which contains sub-folders (according to labels) of images. I want to split this data into train and test set while using ImageDataGenerator in Keras. Although model.fit() in keras has argument validation_split for…

asked Feb 24 '17 at 16:43

Nitin

2,572
5
21
28

votes

4 answers

Normalize data before or after split of training and testing data?

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?

machine-learning data-science normalization training-data train-test-split

asked Mar 23 '18 at 07:13

hemant

votes

1 answer

How to generate a train-test-split based on a group id?

I have the following data: pd.DataFrame({'Group_ID':[1,1,1,2,2,2,3,4,5,5], 'Item_id':[1,2,3,4,5,6,7,8,9,10], 'Target': [0,0,1,0,1,1,0,0,0,1]}) Group_ID Item_id Target 0 1 1 0 1 1 2 0 2 …

python-3.x pandas machine-learning grouping train-test-split

asked Feb 21 '19 at 00:45

Negative Correlation

votes

3 answers

How to perform k-fold cross validation with tensorflow?

I am following the IRIS example of tensorflow. My case now is I have all data in a single CSV file, not separated, and I want to apply k-fold cross validation on that data. I have data_set =…

python tensorflow cross-validation train-test-split

asked Sep 28 '16 at 13:15

mommomonthewind

4,390
11
46
74

votes

2 answers

Should Feature Selection be done before Train-Test Split or after?

Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is…

machine-learning feature-selection train-test-split

asked May 25 '19 at 19:38

Navoneel Chakrabarty

votes

7 answers

Singleton array array(, dtype=object) cannot be considered a valid collection

Not sure how to fix . Any help much appreciate. I saw thi Vectorization: Not a valid collection but not sure if i understood this train = df1.iloc[:,[4,6]] target =df1.iloc[:,[0]] def train(classifier, X, y): X_train, X_test, y_train, y_test =…

python pandas scikit-learn pipeline train-test-split

asked Apr 05 '17 at 05:54

manisha

votes

4 answers

Spark train test split

I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 release. So far I could only find…

apache-spark apache-spark-mllib train-test-split

asked Oct 12 '16 at 09:02

Georg Heiler

16,916
36
162
292

votes

1 answer

Do I have to do one-hot-encoding separately for train and test dataset?

I'm working on a classification problem and I've split my data into train and test set. I have few categorical columns (around 4 -6) and I am thinking of using pd.get_dummies to convert my categorical values to OneHotEncoding. My question is do I…

python machine-learning one-hot-encoding train-test-split

asked Apr 04 '19 at 21:29

Jeeth

2,226
5
24
60

votes

4 answers

Splitting data using time-based splitting in test and train datasets

I know that train_test_split splits it randomly, but I need to know how to split it based on time. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # this splits the data randomly as 67% test and 33%…

python scikit-learn timestamp train-test-split

asked Jun 15 '18 at 17:00

dhruv bhardwaj

votes

11 answers

scikit-learn error: The least populated class in y has only 1 member

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error: In [1]: y.iloc[:,0].value_counts() Out[1]: M2 38 M1 35 M4 29 M5 15 M0 15 M3 15 In…

python scikit-learn train-test-split

asked Apr 03 '17 at 08:00

Aurora

votes

8 answers

Split image dataset into train-test datasets

So I have a main folder which contains sub-folders which in turn contains images for the dataset as…

python-3.x training-data train-test-split

asked Aug 07 '19 at 12:05

Ishan Dixit

votes

2 answers

Setting seed on train_test_split sklearn python

is there any way to set seed on train_test_split on python sklearn. I have set the parameter random_state to an integer, but I still can not reproduce the result. Thanks in advance.

python-3.x scikit-learn jupyter-notebook train-test-split

asked May 16 '19 at 10:12

Bernando Purba

votes

1 answer

How to split data based on a column value in sklearn

I have a data file with following columns 'customer', 'calibrat' - Calibration sample = 1; Validation sample = 0; 'churn', 'churndep', 'revenue', 'mou', Data file contains some 40000 rows out of which 20000 have value for calibrat as 1. I want to…

python machine-learning logistic-regression train-test-split smote

asked Apr 09 '20 at 06:56

Guest

votes

2 answers

Stratified Train/Validation/Test-split in scikit-learn

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data…

python scikit-learn train-test-split

asked Nov 27 '16 at 12:49

blu

votes

0 answers

Managing Train/Develop Splits with the spaCy command line trainer

I am training an NER model using the python -m spacy train command line tool. I use gold.docs_to_json to convert my annotated documents to the JSON-serializable format. The command line training tool uses both a training set and a development set.…

command-line-interface spacy train-test-split

asked Jan 26 '20 at 18:40

W.P. McNeill

16,336
12
75
111

2 3

…

28 29 Next