4

While creating a train,test & cross validation sample in Python, I see the default method as -:

1. Reading the dataset , after skipping headers 2. Creating the train, test and Cross validation sample

 import csv
 with open('C:/Users/Train/Trainl.csv', 'r') as f1:
     next(f1)
     reader = csv.reader(f1, delimiter=',')
     input_set = []   
     for row in reader:
         input_set.append(row)

import numpy as np 
from numpy import genfromtxt
from sklearn import cross_validation
train, intermediate_set = cross_validation.train_test_split(input_set, train_size=0.6, test_size=0.4)
cv, test = cross_validation.train_test_split(intermediate_set, train_size=0.5, test_size=0.5)

My problem though is that I have a field say "A" in the csv file that I read into the numpy array, and all sampling should respect this field. That is, all entries with similar values for "A" should go in one sample .

Line #|A | B | C | D 
1     |1 | 
2     |1 |
3     |1 |
4     |1 |
5     |2 |
6     |2 |
7     |2 |

Required : line 1,2,3,4 should go in "one" sample and 5,6,7 should go in the "one" sample. Value of column A is a unique id, corresponding to one single entity(could be seen as a cross section data points on one SINGLE user, so it MUST go in one unique sample of train, test, or cv), and there are many such entities, so a grouping by entity id is required.

B, C,D columns may have any values, but a grouping preservation is not required on them. (Bonus: can I group the sampling for multiple fields?)

What I tried :

A. Finding all unique values of A's - denoting this as my sample I now distribute the sample among-st train, intermediate & cv & test -> then putting the rest of the rows for this value of "A" in each of these files. that is if train had entry for "3" , test for"2" and cv for "1" then all rows with value of A as 3 go in train, all with 2 go in test and all with 1 go in cv.

  1. Ofcourse this approach is not scalable.
  2. And I doubt, it may have introduced bias into the datasets, since the number of 1's in column A , no of 2's etc. is not equal, meaning this approach will not work !

B. I also tried numpy.random.shuffle, or numpy.random.permutation as per the thread here - Numpy: How to split/partition a dataset (array) into training and test datasets for, e.g., cross validation? , but it did not meet my requirement.

C. A third option of-course is writing a custom function that does this grouping, and then balances the training, test and cv data-sets based on number of data points in each group. But just wondering, if there's already an efficient way to implement this ?

Note my data set is huge, so ideally I would like to have a deterministic way to partition my datasets, without having multiple eye-ball-scans to be sure that the partition is correct.

EDIT Part 2:

Since I did not find any that fit my sampling criteria - I actually wrote a module to sample with grouping constraints. This is the github code to it. The code was not written for very large data in mind, so it's not very efficient. Should you FORK this code - please point out how can I improve the run-time. https://github.com/ekta1007/Sampling-techniques/blob/master/sample_expedia.py

Community
  • 1
  • 1
ekta
  • 1,560
  • 3
  • 28
  • 57
  • I have a [function that does roughly this](https://github.com/larsmans/seqlearn/blob/master/seqlearn/evaluation.py#L90) in my sequence learning extension to scikit-learn. I'm not sure if it's appropriate for your problem, though. – Fred Foo Sep 19 '13 at 07:50

2 Answers2

3

By forcing such constraints you will introduce bias either way, to you procedure. So approach based on the partition of the "users" data and then collecting their respective "measurements" does not seem bad. And it will scale just fine, this is O(n) method, the only reason for not scaling up is bad implementation, not bad method.

The reason for no such functionality in existing methods (like sklearn library) is because it looks highly artificial, and counter machine learning models idea. If these are somehow one entities then they should not be treated as separate data points. If you need this separate representation then requiring such division, that the particular entity cannot be partially in test test and partially in training will for sure bias the whole model.

To sum up - you should really deeply analyze whether your approach is reasonable from the machine learning point of view. If you are sure about it, I think the only possibility is to write the segmentation by yourself, as even though using many ML libraries in the past, I've never seen such functionality.

In fact I am not sure, if the problem of creating segmentation of the set containing N numbers (sizes of entities) into K (=3) subsets of given sums proportions with uniform distribution when treated as a random process is not NP problem on itself. If you cannot guarantee uniform distribution, then your datasets cannot be used as a statistically correct method of training/testing/validating your model. Even if it has a reasonable polynomial solution, it can still scale up badly (much worse then linear methods). This doubt applies if your constraints are "strict", if they are "weak" you can always do "generate and reject" approach, which should have amortized linear complexity.

lejlot
  • 64,777
  • 8
  • 131
  • 164
  • Thanks @lejlot - to be more specific of my application - for my usecase, I have users who are given a list of recommendations for items, so there are multiple entries for one user. ie. the exact number of recommendations per user is different. All I know here is some entries belong to ONE user, these are different from OTHER user. My goal is to better on the existing recommendation listing, say column "B" above. Any afterthoughts other then introducing this artificial bias+balancing total number of data points EACH for test, train & cv datasets ? – ekta Sep 18 '13 at 07:22
  • 1
    In recomender systems you just split data as usual, without any constraints on entities, there is no sense in applying anything else, as your system should be able not only to predict recomendations for the "blank" (new) user, but also suggest new items to existing, so it actually **should** be splitted in a way, that one user is in all of the train/test/cv splits. If knowledge about one user is **useless** for predicting another one then you should not build **one** classifier, but as many as you have users. Either way - it contradicts with proposed approach. – lejlot Sep 18 '13 at 07:46
0

I was also facing similar kind of issue, though my coding is not too good I came up with the solution as given below:

  1. Created a new data frame that only contains the Unique Id of the df and removed duplicates.
new = df[["Unique_Id "]].copy()
New_DF = new.drop_duplicates()
  1. Created training and test set on the basis of New_DF
train, test = train_test_split(New_DF, test_size=0.2)
  1. And then merged those training and test set with original df.
df_Test = pd.merge(df, test, how='inner', on = “Unique_Id”)
df_Train = pd.merge(df, train, how='inner', on = “Unique_Id”)

Similarly, we can create sample for the validation part too.

Cheers.

Saurabh Bade
  • 162
  • 2
  • 15