How to sample data by keeping at least two non zero columns

Question

I have a pandas data frame which is basically 50K X9.5K dimensions. My dataset is binary that is it has 1 and 0 only. And has lot of zeros.

Think of it as a user-item purchase data where its 1 if user purchased an item else 0. Users are rows and items are columns.

353 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
354 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
355 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
356 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
357 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0

I want to split into training, validation and test set. However it is not going to be just normal split by rows.

What I want is that for each validation and test set, I want to keep between 2-4 columns from original data which are non zero.

So basically if my original data had 9.5K columns for each user, I first keep only lets say 1500 or so columns. Then I spit this sampled data into train and test by keeping like 1495-1498 columns in train and 2-5 columns in test/validation. The columns which are in test are ONLY those which are non zero. Training can have both.

I also want to keep the item name/index corresponding to those which are retained in test/validation

I dont want to run a loop to check each cell value and put it in the next table.

Any idea?

EDIT 1:

So this is what I am trying to achieve.

I've read the part about splitting at least three times but still don't understand it. First of all, what is a non-zero column? A column that has *at least* one non-zero element? A column that has non-zero elements for the corresponding rows? All non-zero? Are you sampling from columns? — ayhan, Jul 14 '16 at 18:29
So just take an example of first row of the table. Each row signifies one users interaction. Now each column is a different item. If it is 1 it means he purchased else not. What I want is that if there are 1500 columns, then I want to put 2-5 columns in test/validation data which have 1 as the value and remaining into training. And I want to do this for each user for the whole table — Baktaawar, Jul 14 '16 at 18:34
So for each user you have different training and test sets? For user 1, you will train on some columns, test on non-zero columns but for user 2 these columns will change? Are you sure that's a good idea? — ayhan, Jul 14 '16 at 18:37
Yes. thats right. I am doing a recommender system so need to have few non zero columns for each user — Baktaawar, Jul 14 '16 at 18:50

score 1 · Answer 1 · answered Jul 14 '16 at 18:29

1

So, by non-zero, I am guessing you mean those columns which only have ones in them. That is fairly easy to do. Best approach probably is to use sum, like so:

sums = df.sum(axis=1) # to sum along columns. You will have a Series with column names as indices, and column sums as values.
non_zero_cols = sums[sums = len(df)].index # this will have only column names with non-zero records

# Now to split the data into training and testing
test_cols = numpy.random.choice(non_zero_cols, 2, replace=False) # or 5, just randomly selecting columns.
test_data = df[test_cols]
train_data = df.drop(test_cols)

Is that what your are looking for?

answered Jul 14 '16 at 18:29

Kartik

8,347
39
73

quick question. When you find non zero cols it would be a list of all columns which are non zero across users right?. So when u pick any two or 5 random from this, then there is a high chance it can pick columns which might not be non zero for a particular user but might be for other. Then in that case we would have for some users cols in test data which would be zero and not non zero right – Baktaawar Jul 14 '16 at 18:54
Wait. you managed to confuse me. Here is what happens, you are taking sums over columns, so for each column, you are taking the sum of all rows. This sum should, in case of columns without any zeros, equal the total number of rows. The `non_zero_cols` is a list of these column names. Picking random names from this list will leave you with columns that are still 1 for all rows (items). Or do you want rows? You mentioned that users are rows and products are columns. Do you want those users who have bought all products, or those products bought by all users? – Kartik Jul 14 '16 at 19:04
Updated the answer with a screenshot – Baktaawar Jul 14 '16 at 19:50
Oh, so you want to do this for each user... I would suggest taking a stab at @piRSquared's answer. But to do this for each row, you will need some kind of a loop, otherwise how will you fit your model and evaluate it for each row? – Kartik Jul 14 '16 at 19:59

score 0 · Answer 2 · edited May 23 '17 at 11:45

0

IIUC:

threshold = 6
new_df = df.loc[df.sum(1) >= threshold]

df.sum(1) sums over each row. Since these are 1s and 0s, this is equivalent to counting.

df.sum(1) >= threshold creates series of Trues and Falses, also referred to as a boolean mask.

df.loc happens to accept boolean masks as a way to slice.

df.loc[df.sum(1) >= threshold] passes the boolean mask to df.loc and returns only those rows that had a corresponding True in the boolean mask.

Since the boolean mask only had Trues when there existed a count of 1s greater than or equal to threshold, this equates to returning a slice of the dataframe in which each row has at least a threshold number of non-zeroes.

And then refer to this answer on how to split into test, train, validation sets.

Or this answer

edited May 23 '17 at 11:45

Community

1
1

answered Jul 14 '16 at 18:33

piRSquared

285,575
57
475
624

could you help understand the second line? I am not sure I get that? – Baktaawar Jul 14 '16 at 20:14
got it. So this would give me a data frame were each user/row has atleast a threshold level of non zeros. Fine. But it doesn't help me do split now. The spit which you have mentioned is by rows. Now what I want is that for each user I need min 2 non zero cell values to be put in test and remaining in training. Problem is it would mean we won't be having same columns for each user since some users might not have the same non zero columns in test. – Baktaawar Jul 14 '16 at 21:02

How to sample data by keeping at least two non zero columns

2 Answers2