If you want to ensure that your training and test partitions don't contain the same pairings of user and item then you could replace each unique (user, item) combination with an integer label, then pass these labels to LabelKFold
. To assign integer labels to each unique pairing you could use this trick:
import numpy as np
import pandas as pd
from sklearn.cross_validation import LabelKFold
df = pd.DataFrame({'users':[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
'items':[0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1],
'ratings':[2, 4, 3, 1, 4, 3, 0, 0, 0, 1, 0, 1]})
users_items = df[['users', 'items']].values
d = np.dtype((np.void, users_items.dtype.itemsize * users_items.shape[1]))
_, uidx = np.unique(np.ascontiguousarray(users_items).view(d), return_inverse=True)
for train, test in LabelKFold(uidx):
# train your classifier using df.loc[train, ['users', 'items']] and
# df.loc[train, 'ratings']...
# cross-validate on df.loc[test, ['users', 'items']] and
# df.loc[test, 'ratings']...
I'm still having a hard time understanding your question. If you want to guarantee that your training and test sets do contain examples of the same user then you could use StratifiedKFold
:
for train, test in StratifiedKFold(df['users']):
# ...