0

I have a large dataset (around 200k rows), i wanted to split the dataset into 2 parts randomly, 70% as the training data and 30% as the testing data. Is there a way to do this in python? Note I also want to get these datasets saved as excel or csv files in my computer. Thanks!

huy
  • 176
  • 2
  • 13
  • Load the data into pandas, and you can use `train_test_split` in sklearn to split the data according to your need – badhusha muhammed Jul 23 '20 at 12:50
  • Does this answer your question? [Train-test Split of a CSV file in Python](https://stackoverflow.com/questions/50040238/train-test-split-of-a-csv-file-in-python) – badhusha muhammed Jul 23 '20 at 12:51
  • hi..this clarifies splitting for me.. however i also wanted to know how i can save the entire training data (x_train, y_train) as a single csv file – huy Jul 23 '20 at 12:53
  • 1
    There are so many answers within SO for this particular question. Instead of searching for those, you had to create a new question. sigh! – kleerofski Jul 23 '20 at 12:57
  • @kleerofski sorry for the trouble...im new to python and SO in general – huy Jul 23 '20 at 13:26

2 Answers2

4
from sklearn.model_selection import train_test_split
#split the data into train and test set
train,test = train_test_split(data, test_size=0.30, random_state=0)
#save the data
train.to_csv('train.csv',index=False)
test.to_csv('test.csv',index=False)
Rajat Agarwal
  • 174
  • 3
  • 6
0

Start by importing the following:

from sklearn.model_selection import train_test_split
import pandas as pd

In order to split you can use the train_test_split function from sklearn package:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

where X, y is your taken from your original dataframe.

Later, you can export each of them as CSV using the pandas package:

X_train.to_csv(index=False)
X_test.to_csv(index=False)

Same goes for y data as well.

EDIT: as you clarified the question and required both X and y factors on the same file, you can do the following:

train, test = train_test_split(yourdata, test_size=0.3, random_state=42)

and then export them to csv as I mentioned above.

My Koryto
  • 657
  • 1
  • 4
  • 16