I have a large dataset (around 200k rows), i wanted to split the dataset into 2 parts randomly, 70% as the training data and 30% as the testing data. Is there a way to do this in python? Note I also want to get these datasets saved as excel or csv files in my computer. Thanks!
Asked
Active
Viewed 9,403 times
0
-
Load the data into pandas, and you can use `train_test_split` in sklearn to split the data according to your need – badhusha muhammed Jul 23 '20 at 12:50
-
Does this answer your question? [Train-test Split of a CSV file in Python](https://stackoverflow.com/questions/50040238/train-test-split-of-a-csv-file-in-python) – badhusha muhammed Jul 23 '20 at 12:51
-
hi..this clarifies splitting for me.. however i also wanted to know how i can save the entire training data (x_train, y_train) as a single csv file – huy Jul 23 '20 at 12:53
-
1There are so many answers within SO for this particular question. Instead of searching for those, you had to create a new question. sigh! – kleerofski Jul 23 '20 at 12:57
-
@kleerofski sorry for the trouble...im new to python and SO in general – huy Jul 23 '20 at 13:26
2 Answers
4
from sklearn.model_selection import train_test_split
#split the data into train and test set
train,test = train_test_split(data, test_size=0.30, random_state=0)
#save the data
train.to_csv('train.csv',index=False)
test.to_csv('test.csv',index=False)

Rajat Agarwal
- 174
- 3
- 6
0
Start by importing the following:
from sklearn.model_selection import train_test_split
import pandas as pd
In order to split you can use the train_test_split function from sklearn package:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
where X, y is your taken from your original dataframe.
Later, you can export each of them as CSV using the pandas package:
X_train.to_csv(index=False)
X_test.to_csv(index=False)
Same goes for y data as well.
EDIT: as you clarified the question and required both X and y factors on the same file, you can do the following:
train, test = train_test_split(yourdata, test_size=0.3, random_state=42)
and then export them to csv as I mentioned above.

My Koryto
- 657
- 1
- 4
- 16
-
-
-
the train,test was a single file before, and we spited it, and again you need it in single file..?I don't understand what you are up to.? – badhusha muhammed Jul 23 '20 at 12:54