Randomly splitting training and testing data

Question

I have around 3000 objects where each object has a count associated with it. I want to randomly divide these objects in training and testing data with a 70% training and 30% testing split. But, I want to divide them based on the count associated with each object but not based on the number of objects.

An example, assuming my dataset contains 5 objects.

Obj 1 => 200
Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110

If I split them with a nearly 70%-30% ratio, my training set should be

Obj 2 => 30
Obj 3 => 40
Obj 4 => 20
Obj 5 => 110

and my testing set would be

Obj 1 => 200

If I split them again, I should get a different training and testing set nearing the 70-30 split ratio. I understand the above split does not give me pure 70-30 split but as long as it nears it, it's acceptable.

Are there any predefined methods/packages to do this in Python?

Possible duplicate of [Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation?](http://stackoverflow.com/questions/3674409/numpy-how-to-split-partition-a-dataset-array-into-training-and-test-datasets) — Zafi, Jul 27 '16 at 13:49
Just for the record, this is probably a really bad idea. You generally want to keep your training set the same so that you don't train to your test data. — Oscar Smith, Jul 27 '16 at 14:16

James · Answer 1 · 2016-07-27T14:08:02.647

2

Assuming I understand your question correctly, my suggestion would be this:

from random import shuffle
sum = sum([obj.count for obj in obj_list]) #Get the total "count" of all the objects, O(n)
shuffle(obj_list)
running_sum = 0
i = 0
while running_sum < sum * .3
    running_sum += obj_list[i].count
    i += 1
training_data = obj_list[i:]
testing_data = obj_list[:i]

This entire operation is O(n), you're not going to get any better time complexity than that. There's certainly ways to condense the loop and whatnot into one liners, but I don't know of any builtins that accomplish what you're asking with a single function, especially not when you're asking it to be "random" in the sense that you want a different training/testing set each time you split it (as I understand the question)

edited Jul 27 '16 at 14:08

answered Jul 27 '16 at 13:53

James

2,843
1
14
24

Thank you for your response. You've understood my problem correctly. The approach is pretty optimized. I agree that to get the sum, I will have to loop it through all the objects once. Thus, the O(n). However, the line `if running_sum > sum * .7` will make the training set always more than 70%, am I correct in making this statement. – van_d39 Jul 27 '16 at 14:09
More by a single item, you're right. I guess I assumed that on a set of 3000 items a single item wouldn't make a large difference. If that is an issue, then I would add the line i -= randint(0,1) so that it is randomly either slightly lower than .7 or slightly higher – James Jul 27 '16 at 14:11
1

I also edited to make the loop stop after .3, realizing that you only need to find the first .3 to know .7, so going to .3 is faster - that'll save some time, not sure why I didn't think of that originally – James Jul 27 '16 at 14:14
1

In this case, there is a likelihood that first entry may be much greater than 30% (say 50%). So, you will end up with unwanted split. Running loop till 0.7 provides a greater safety net. – Learner Jul 27 '16 at 14:18
@Learner It depends on exactly how uniform the testing data is. You could certainly write the function to either try again or skip data if it's outside some range of acceptability, but the disadvantage of that is that it makes it not uniformly random. You're right though, if getting exactly close to .7 matters more than the speed of the function, it may be better to loop until .7 – James Jul 27 '16 at 14:20

Learner · Answer 2 · 2016-07-27T14:12:54.773

I do not know if there is a specific function in Python, but assuming there isn't, here is an approach.

Shuffle objects:

 from random import shuffle
 values = shuffle[200, 40, 30, 110, 20]

Calculate percentage of dictionary values:

 prob = [float(i)/sum(values) for i in values]

Apply a loop:

sum=0
for i in range(len(result)):
    if sum>0.7:
        index=i-1  
        break
    sum=sum+result[i]

Now, objects before index are training objects and after it are testing objects.

Randomly splitting training and testing data

2 Answers2