How to split train and test dataset to X_Train y_train and X_Test y_Test?

Question

So I successfully split my dataset into Train & Test in a ratio of 70:30 I used this:

df_glass['split'] = np.random.randn(df_glass.shape[0], 1)
msk = np.random.rand(len(df_glass)) <= 0.7
train = df_glass[msk]
test = df_glass[~msk]
print(train)
print(test)

Now how do I split train and test to X_train and y_train and X_test and y_test Such that, X denotes the features of the database and y denotes the response?

I need to do supervised learning and apply ML modules on X_Train and y_Train.

My database looks like this: Database_snippet

score 3 · Answer 1 · answered Nov 16 '17 at 04:39

3

Scikit-Learn has a convenience method for splitting pandas dataframes -

This will do the split -

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[list_of_X_cols], df['y'], test_size=0.33, random_state=42)

answered Nov 16 '17 at 04:39

Vivek Kalyanarangan

8,951
1
23
42

I am a beginner so can you explain me what should I pass through "list_of_X_cols" – Gaurav Singh Nov 16 '17 at 04:47
the list of columns that you will treat as independant variables. These are basically a comma separated list of column names in your data – Vivek Kalyanarangan Nov 16 '17 at 04:51
Great! Thank yo u Vivek – Gaurav Singh Nov 16 '17 at 05:00

score 2 · Accepted Answer · answered Nov 16 '17 at 05:20

2

i guess you may found this useful to understand..

import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

#importing dataset
dataset = pd.read_csv('Salary_Data.csv')
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values

#spliting the dataset into training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, 
test_size=1/3, random_state=0)

answered Nov 16 '17 at 05:20

Ariful Shuvo

66
5

Hi can you help me understand the meaning of : x = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values Acc to the database my features are in first 5 columns and the last column is the response. – Gaurav Singh Nov 16 '17 at 05:37
a Few tweaks here and there and It worked! Thanks – Gaurav Singh Nov 16 '17 at 06:05
iloc is just basically integer-location based indexing for selection by position. my model was a simple linear regression with one independent variable and i was splitting the data into x = "independent variable" and y = "dependent variable" following the linear equation y = mx + b. – Ariful Shuvo Nov 17 '17 at 04:03

How to split train and test dataset to X_Train y_train and X_Test y_Test?

2 Answers2