0

I'm need to separate a pandas data frame who i was read to csv, this data set need to be separated in 3 groups, training test and validation. But my problem is i don't know how many attributes the csv have, because i'm working with a lot of bases with different sizes of attributes( ones have 3 or 4 and others has 40+). I'm need to separate in parts

  • Training = 50%
  • Test = 25%
  • Validation = 25%

So if i'm have 5 attributes with 100 values each, i'm need to get 50 lines just for train. How can i separate all the attributes and in the final i'm get a new Data Frame for each group, always keeping the right proportion have already implemented the function to read csv, if you can see they are generic, because they just only receive the path where are the csv and return a new Data Frame of this.

import pandas as pd


class Entity:

    def __init__(self, path):
        self.data_frame = pd.read_csv(path)

    def get_value(self, attr):
        return self.data_frame[attr]

    def split_set(self):
        pass

This class is the generic, i'm need to create this function split_set to separate the set. I'm starting with panda and python now, sorry if this apparently is very easy to solve but I cannot think in a good way to do this. Thanks in advance.

4 Answers4

0

Add a column R to your data. Assign to it either hash of row, or a random number, so its value lies between 0 and 1.

Then 0 <= R < .5 implies a training row, .5 <= R < .75 implies test, and .75 <= R < 1 implies validation.

J_H
  • 17,926
  • 4
  • 24
  • 44
0

I think you can randomly reorder the dataframe and pick the top 50% as train, 50%-75% as test, 75%-100%.

df = df.sample(frac=1)  # randomly reorder the whole dataframe
n_rows = len(df)

train_idx = n_rows // 2
test_idx = train_idx + n_rows // 4

train = df.iloc[:train_idx, :]
test = df.iloc[train_idx: test_idx, :]
val = df.iloc[test_idx:, :]

Hope it helps!

0

There is one method you can use in sklearn library is sklearn.model_selection.train_test_split.

import numpy as np
from sklearn.model_selection import train_test_split

X= np.arange(10).reshape((5, 2))
X_train, X_test = train_test_split(X, test_size=0.33, random_state=42)

then you can see the data are separated to training and testing dataset. For more set of data, you can repeat the step until you get what you need.

tomcy
  • 445
  • 3
  • 8
0

You can use sklearn library

import sklearn
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, train_size=0.5)
youssef mhiri
  • 133
  • 3
  • 11
  • What are the difference between X_train and Y_train in this case? Because i don't need to separate them – João Victor Canabarro Apr 16 '18 at 23:17
  • Because if i'm want to separate the set in 3 parts, i'll do like this `train, test = train_test_split(data_frame, test_size=0.5, train_size=0.5)` and after this do another `test, validation = train_test_split(test, test_size=0.5, train_size=0.5)` to separate tests and validation? – João Victor Canabarro Apr 16 '18 at 23:32
  • Yes you don't need to add X_train Y_train, it was just an example X was the entry and Y was the result – youssef mhiri Apr 17 '18 at 21:32