0

I'm recently studying neural network and panda dataframe, the dataset that I have is split into several .csv files, and for the train dataset I load them as follows:

df1 = pd.read_csv("/home/path/to/file/data1.csv")
df2 = pd.read_csv("/home/path/to/file/data2.csv")
df3 = pd.read_csv("/home/path/to/file/data3.csv")
df4 = pd.read_csv("/home/path/to/file/data4.csv")
df5 = pd.read_csv("/home/path/to/file/data5.csv")

trainDataset = pd.concat([df1, df2, df3, df4, df5])

Then, as suggested by many articles, the test dataset should be around 20% of the train dataset. My questions are:

  1. How can I define the test dataset to be 20% of the train dataset?
  2. When I load both train and test dataset, what is the best way to randomize the data?

I tried this solution, and wrote the following code but it didn't work:

testDataset = train_test_split(trainDataset, test_size=0.2)

I appreciate any tips or help for this matter.

saadh17
  • 11
  • 2

1 Answers1

2

The function train_test_split would give you the answer, but I'm a bit surprised by the call you had in your example.

It is more common to have something like this, with x being the features (the x in y=f(x), with f is the real function that you try to mimic with your learning) and y being the responses (the y in y=f(x)).

from sklearn.model_selection import train_test_split
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2)

For more explanations, please see https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

Michael Hooreman
  • 582
  • 1
  • 5
  • 16
  • No `y` is not needed in the `train_test_split`. The why OP is using is valid. The problem is not the way he is using. – mujjiga Jun 23 '20 at 09:30
  • `train_test_split(data, test_size=0.2, random_state=42)` add `random_state` for fixed random shuffling – Tserenjamts Jun 23 '20 at 09:30
  • this will randomize the test dataset, is it possible to randomize the train dataset as well? – saadh17 Jun 23 '20 at 13:25
  • @mujjiga indeed, but that’s easier to understand that way, given obviously the hypothesis that this is supervised learning – Michael Hooreman Jun 23 '20 at 16:01
  • @saadh17 It creates a dedicated test and training set. If you have another test set, you can use it for cross validation, for example to compare different learnings (algo, hyper parameters, etc) – Michael Hooreman Jun 23 '20 at 16:03
  • Regarding the random_state, it depends if you want to have exactly the same selection (reproducible research) or not. – Michael Hooreman Jun 23 '20 at 16:06
  • Given the fact that it is randomization, I think it is more honest to keep it “out of control “ ... otherwise good results might be due to an (improbable) set of lucky situations. If you keep it random, you only rely on the quality of the learning algorithm, and not a lucky set of examples. Also, repeating it with pure randomness can give an idea of the variance of the evaluation. – Michael Hooreman Jun 23 '20 at 16:10