2

I need to split dataframe into 10 parts then use one part as the testset and remaining 9 (merged to use as training set) , I have come up to the following code where I am able to split the dataset , and m trying to merge the remaining sets after picking one of those 10. The first iteration goes fine , but I get following error in second iteration.

df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))

for x in range(3):
    dfList = np.array_split(df, 3)
    testdf = dfList[x]
    dfList.remove(dfList[x])
    print testdf
    traindf = pd.concat(dfList)
    print traindf
    print "================================================"

enter image description here

swati saoji
  • 1,987
  • 5
  • 25
  • 35
  • Why not scikit-learn Cross Validation? http://scikit-learn.org/stable/modules/cross_validation.html#random-permutations-cross-validation-a-k-a-shuffle-split – Liam Foley Apr 02 '15 at 03:12
  • I am doing this as an assignment as a part of course and trying to implement validation. – swati saoji Apr 02 '15 at 03:14

5 Answers5

2

I don't think you have to split the dataframe in 10 but just in 2. I use this code for splitting a dataframe in training set and validation set:

test_index = np.random.choice(df.index, int(len(df.index)/10), replace=False)

test_df = df.loc[test_index]

train_df = df.loc[~df.index.isin(test_index)]

Spas
  • 840
  • 16
  • 13
  • this is a much better solution – Haleemur Ali Apr 02 '15 at 14:38
  • @Haleemur Ali --- this is good if i need to just devide it once into 1:9 ------thats randomly selecting 1/10th as test set but, I am trying to implement k-fold validation , where as far as I understand :you break the data into K-blocks. Then, for K = 1 to X, you make the Kth block the test block and the rest of the data becomes the training data. Train, test, record and then update K. – swati saoji Apr 02 '15 at 18:51
  • You can split the dataframe index into chunks and loop through it. See http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python – Spas Apr 03 '15 at 13:19
0

okay I got it working this way :

df = pd.DataFrame(np.random.randn(10, 4), index=list(xrange(10)))

dfList = np.array_split(df, 3)
for x in range(3):
    trainList = []
    for y in range(3):
        if y == x :
            testdf = dfList[y]
        else:
            trainList.append(dfList[y])
    traindf = pd.concat(trainList)
    print testdf
    print traindf
    print "================================================"

But better approach is welcome.

enter image description here

swati saoji
  • 1,987
  • 5
  • 25
  • 35
0

You can use the permutation function from numpy.random

import numpy as np
import pandas as pd
import math as mt
l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
df = pd.DataFrame({'a': l, 'b': l})

shuffle the dataframe index

shuffled_idx = np.random.permutation(df.index)    

divide the shuffled_index into N equal(ish) parts
for this example, let N = 4

N = 4
n = len(shuffled_idx) / N
parts = []
for j in range(N):
    parts.append(shuffled_idx[mt.ceil(j*n): mt.ceil(j*n+n)])

# to show each shuffled part of the data frame
for k in parts:
    print(df.iloc[k])
Haleemur Ali
  • 26,718
  • 5
  • 61
  • 85
0

I wrote a piece of script find / fork it on github for the purpose of splitting a Pandas dataframe randomly. Here's a link to Pandas - Merge, join, and concatenate functionality!

Same code for your reference:

    import pandas as pd
    import numpy as np

    from xlwings import Sheet, Range, Workbook

    #path to file
    df = pd.read_excel(r"//PATH TO FILE//")

    df.columns = [c.replace(' ',"_") for c in df.columns]
    x = df.columns[0].encode("utf-8")

#number of parts the data frame or the list needs to be split into
    n = 7
    seq = list(df[x])
    np.random.shuffle(seq)
    lists1 = [seq[i:i+n] for i  in range(0, len(seq), n)]
    listsdf = pd.DataFrame(lists1).reset_index()

    dataframesDict = dict()

# calling xlwings workbook function 

    Workbook()

    for i in range(0,n):

      if Sheet.count() < n:

         Sheet.add()

         doubles[i] = 

           df.loc[df.Column_Name.isin(list(listsdf[listsdf.columns[i+1]]))]

         Range(i,"A1").value = doubles[i]
0

Looks like you are trying to do a k-fold type thing, rather than a one-off. This code should help. You may also find the SKLearn k-fold functionality works in your case, that's also worth checking out.

# Split dataframe by rows into n roughly equal portions and return list of 
# them.
def splitDf(df, n) :
    splitPoints = list(map( lambda x: int(x*len(df)/n), (list(range(1,n)))))     
    splits = list(np.split(df.sample(frac=1), splitPoints))
    return splits

# Take splits from splitDf, and return into test set (splits[index]) and training set (the rest)
def makeTrainAndTest(splits, index) :
   # index is zero based, so range 0-9 for 10 fold split
   test = splits[index]

   leftLst = splits[:index]
   rightLst = splits[index+1:]

   train = pd.concat(leftLst+rightLst)

   return train, test

You can then use these functions to make the folds

df = <my_total_data>
n = 10
splits = splitDf(df, n)
trainTest = []
for i in range(0,n) :
     trainTest.append(makeTrainAndTest(splits, i))

 # Get test set 2
 test2 = trainTest[2][1].shape

 # Get training set zero
 train0 = trainTest[0][0]
Tom Walker
  • 837
  • 1
  • 8
  • 12