Questions tagged [train-test-split]

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

Questions with this tag are about how to split the machine learning data set into random train and test subsets.

In particular questions with this tag can be aimed at understanding better how to split the data with the scikit-learn functionality. In scikit-learn a random split into training and test sets can be quickly computed with the train_test_split helper function.

Reference: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

428 questions
129
votes
13 answers

Keras split train test set when using ImageDataGenerator

I have a single directory which contains sub-folders (according to labels) of images. I want to split this data into train and test set while using ImageDataGenerator in Keras. Although model.fit() in keras has argument validation_split for…
Nitin
  • 2,572
  • 5
  • 21
  • 28
77
votes
4 answers

Normalize data before or after split of training and testing data?

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?
30
votes
1 answer

How to generate a train-test-split based on a group id?

I have the following data: pd.DataFrame({'Group_ID':[1,1,1,2,2,2,3,4,5,5], 'Item_id':[1,2,3,4,5,6,7,8,9,10], 'Target': [0,0,1,0,1,1,0,0,0,1]}) Group_ID Item_id Target 0 1 1 0 1 1 2 0 2 …
30
votes
3 answers

How to perform k-fold cross validation with tensorflow?

I am following the IRIS example of tensorflow. My case now is I have all data in a single CSV file, not separated, and I want to apply k-fold cross validation on that data. I have data_set =…
mommomonthewind
  • 4,390
  • 11
  • 46
  • 74
28
votes
2 answers

Should Feature Selection be done before Train-Test Split or after?

Actually, there is a contradiction of 2 facts that are the possible answers to the question: The conventional answer is to do it after splitting as there can be information leakage, if done before, from the Test-Set. The contradicting answer is…
27
votes
7 answers

Singleton array array(, dtype=object) cannot be considered a valid collection

Not sure how to fix . Any help much appreciate. I saw thi Vectorization: Not a valid collection but not sure if i understood this train = df1.iloc[:,[4,6]] target =df1.iloc[:,[0]] def train(classifier, X, y): X_train, X_test, y_train, y_test =…
manisha
  • 455
  • 2
  • 7
  • 10
21
votes
4 answers

Spark train test split

I am curious if there is something similar to sklearn's http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html for apache-spark in the latest 2.0.1 release. So far I could only find…
Georg Heiler
  • 16,916
  • 36
  • 162
  • 292
20
votes
1 answer

Do I have to do one-hot-encoding separately for train and test dataset?

I'm working on a classification problem and I've split my data into train and test set. I have few categorical columns (around 4 -6) and I am thinking of using pd.get_dummies to convert my categorical values to OneHotEncoding. My question is do I…
Jeeth
  • 2,226
  • 5
  • 24
  • 60
18
votes
4 answers

Splitting data using time-based splitting in test and train datasets

I know that train_test_split splits it randomly, but I need to know how to split it based on time. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # this splits the data randomly as 67% test and 33%…
dhruv bhardwaj
  • 373
  • 2
  • 3
  • 10
17
votes
11 answers

scikit-learn error: The least populated class in y has only 1 member

I'm trying to split my dataset into a training and a test set by using the train_test_split function from scikit-learn, but I'm getting this error: In [1]: y.iloc[:,0].value_counts() Out[1]: M2 38 M1 35 M4 29 M5 15 M0 15 M3 15 In…
Aurora
  • 321
  • 1
  • 3
  • 6
15
votes
8 answers

Split image dataset into train-test datasets

So I have a main folder which contains sub-folders which in turn contains images for the dataset as…
Ishan Dixit
  • 379
  • 1
  • 3
  • 11
14
votes
2 answers

Setting seed on train_test_split sklearn python

is there any way to set seed on train_test_split on python sklearn. I have set the parameter random_state to an integer, but I still can not reproduce the result. Thanks in advance.
11
votes
1 answer

How to split data based on a column value in sklearn

I have a data file with following columns 'customer', 'calibrat' - Calibration sample = 1; Validation sample = 0; 'churn', 'churndep', 'revenue', 'mou', Data file contains some 40000 rows out of which 20000 have value for calibrat as 1. I want to…
10
votes
2 answers

Stratified Train/Validation/Test-split in scikit-learn

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data…
blu
  • 829
  • 2
  • 7
  • 14
8
votes
0 answers

Managing Train/Develop Splits with the spaCy command line trainer

I am training an NER model using the python -m spacy train command line tool. I use gold.docs_to_json to convert my annotated documents to the JSON-serializable format. The command line training tool uses both a training set and a development set.…
W.P. McNeill
  • 16,336
  • 12
  • 75
  • 111
1
2 3
28 29