I'm recently studying neural network and panda dataframe, the dataset that I have is split into several .csv files, and for the train dataset I load them as follows:
df1 = pd.read_csv("/home/path/to/file/data1.csv")
df2 = pd.read_csv("/home/path/to/file/data2.csv")
df3 = pd.read_csv("/home/path/to/file/data3.csv")
df4 = pd.read_csv("/home/path/to/file/data4.csv")
df5 = pd.read_csv("/home/path/to/file/data5.csv")
trainDataset = pd.concat([df1, df2, df3, df4, df5])
Then, as suggested by many articles, the test dataset should be around 20% of the train dataset. My questions are:
- How can I define the test dataset to be 20% of the train dataset?
- When I load both train and test dataset, what is the best way to randomize the data?
I tried this solution, and wrote the following code but it didn't work:
testDataset = train_test_split(trainDataset, test_size=0.2)
I appreciate any tips or help for this matter.