How can I ensure that the users and items appear in both train and test data set with train_test_split in sklearn?

Question

I have a data set including user ID, item ID, and rating as below:

user ID     item ID    rating
 1233        1011       4
 1220        0999       3
 2011        0702       1
 ...

When I split them into train and test sets:

from sklearn import cross_validation

train, test = cross_validation.train_test_split(df, test_size = 0.2)

Whether the users in test set have already appeared in the train set, and so have items? If not, how can I do that? I can not find the answer in document. Could you please tell me?

I don't understand the question. What exactly do you want to do? — MB-F, Feb 15 '16 at 14:19
@kazemakase The model is to predict the `rating` from `user` to `item` in test set. To do so, we must measure the latent factors of `user` and `item` in train set. So how can I ensure the users in the test set are also in the train set. Of coz, the same thing should happen in the items. Is it better? — user5779223, Feb 15 '16 at 16:33
I don't really understand what you're asking either. Do you want to stratify on users, on items, or on unique combinations of user and item? For example, would you allow your training and test partitions to both contain rankings of different items by user X, or for both to contain rankings of item Y by different users? Would it be OK for them both to contain examples of user X and item Y as long as they don't both contain a rating of item Y by user X? — ali_m, Feb 16 '16 at 22:39
@ali_m That's what I mean: `allow your training and test partitions to both contain rankings of different items by user X, or for both to contain rankings of item Y by different users` and `OK for them both to contain examples of user X and item Y as long as they don't both contain a rating of item Y by user X` — user5779223, Feb 17 '16 at 09:33

score 0 · Answer 1 · edited May 23 '17 at 10:28

If you want to ensure that your training and test partitions don't contain the same pairings of user and item then you could replace each unique (user, item) combination with an integer label, then pass these labels to LabelKFold. To assign integer labels to each unique pairing you could use this trick:

import numpy as np
import pandas as pd
from sklearn.cross_validation import LabelKFold

df = pd.DataFrame({'users':[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
                   'items':[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
                   'ratings':[2, 4, 3, 1, 4, 3, 0, 0, 0, 1, 0, 1]})

users_items = df[['users', 'items']].values
d = np.dtype((np.void, users_items.dtype.itemsize * users_items.shape[1]))
_, uidx = np.unique(np.ascontiguousarray(users_items).view(d), return_inverse=True)

for train, test in LabelKFold(uidx):

    # train your classifier using df.loc[train, ['users', 'items']] and
    # df.loc[train, 'ratings']...

    # cross-validate on df.loc[test, ['users', 'items']] and
    # df.loc[test, 'ratings']...

I'm still having a hard time understanding your question. If you want to guarantee that your training and test sets do contain examples of the same user then you could use StratifiedKFold:

for train, test in StratifiedKFold(df['users']):
    # ...

Sorry, I am not worried about the training set and test set contain same pair of user and item. I am worried about the user appear in the test set has not been measured in the training set. — user5779223, Feb 17 '16 at 13:13
See my edit. I'm finding it difficult to infer what you want since your question is very ambiguously worded. — ali_m, Feb 17 '16 at 13:38

score 0 · Answer 2 · answered Apr 20 '20 at 12:17

def train_test_split(self, ratings, train_rate=0.8):
        """
        Split ratings into Training set and Test set

        """
        grps = ratings.groupby('user_id').groups
        test_df_index = list()
        train_df_index = list()

        test_iid = list()
        train_iid = list()

        for key in grps:
            count = 0
            local_index = list()
            grp = np.array(list(grps[key]))

            n_test = int(len(grp) * (1 - train_rate))
            for i, index in enumerate(grp):
                if count >= n_test:
                    break
                if ratings.iloc[index]['movie_id'] in test_iid:
                    continue
                test_iid.append(ratings.iloc[index]['movie_id'])
                test_df_index.append(index)
                local_index.append(i)
                count += 1

            grp = np.delete(grp, local_index)

            if count < n_test:
                local_index = list()
                for i, index in enumerate(grp):
                    if count >= n_test:
                        break
                    test_iid.append(ratings.iloc[index]['movie_id'])
                    test_df_index.append(index)
                    local_index.append(i)
                    count += 1

                grp = np.delete(grp, local_index)

            train_df_index.append(grp)

        test_df_index = np.hstack(np.array(test_df_index))
        train_df_index = np.hstack(np.array(train_df_index))

        np.random.shuffle(test_df_index)
        np.random.shuffle(train_df_index)

        return ratings.iloc[train_df_index], ratings.iloc[test_df_index]

You can use this method to split, I've already done efforts to make sure that the training set and test set have the same user id and movie id.

How can I ensure that the users and items appear in both train and test data set with train_test_split in sklearn?

2 Answers2