4

I'm using CreateML to generate a Recommender model using an implicit dataset of the format: User ID, Item ID. The data is loaded into CreateML as a CSV with about 400k rows.

When attempting to 'Train' the model, I receive the following error:

Training Error: Item IDs in the recommender model must be numbered 0, 1, ..., num_items - 1

My dataset is in the following format:

"user_id","item_id"
"e7ca1b039bca4f81a33b21acc202df24","f7267c60-6185-11ea-b8dd-0657986dc989"
"1cd4285b19424a94b33ad6637ec1abb2","e643af62-6185-11ea-9d27-0657986dc989"
"1cd4285b19424a94b33ad6637ec1abb2","f2fd13ce-6185-11ea-b210-0657986dc989"
"1cd4285b19424a94b33ad6637ec1abb2","e95864ae-6185-11ea-a254-0657986dc989"
"31042cbfd30c42feb693569c7a2d3f0a","e513a2dc-6185-11ea-9b4c-0657986dc989"
"39e95dbb21854534958d53a0df33cbf2","f27f62c6-6185-11ea-b14c-0657986dc989"
"5c26ca2918264a6bbcffc37de5079f6f","ec080d6c-6185-11ea-a6ca-0657986dc989"

I've tried modifying both Item ID and User ID to enumerated IDs, but I still receive the training error. Example:

"item_ids","user_ids"
0,0
1,0
2,0
2,0
0,225
400,225
409,225
0,282
0,4
8,4
8,4

I receive this error both within the CreateML UI and when using CreateML within a Swift playground. I've also tried removing duplicates and verified that the maximum ID for each column is (num_items - 1).

I've searched for documentation on what the exact requirement is for the set of IDs with no luck.

Thank you in advance for any helping clarifying this error message.

mpmontanez
  • 91
  • 6

2 Answers2

3

I was able to discuss this issue with Apple's CoreML developers during WWDC2020. They described this as a known bug which will be fixed with the upcoming OS (Big Sur). The work-around for this bug is:

In the CSV dataset, create records for a single user which interacts with ALL items, and create records for a single item interacted with by ALL users.

Using pandas in python, I essentially implemented the following:

# Find the unique item ids
item_ids = ratings_df.item_id.unique()

# Find the unique user ids
user_ids = ratings_df.user_id.unique()

# Create a 'dummy user' which interacts with all items
mock_item_interactions_df = pd.DataFrame({'item_id': item_ids, 'user_id': 'mock-user'})
ratings_with_mocks_df = ratings_df.append(mock_item_interactions_df)

# Create a 'dummy item' which interacts with all users
mock_item_interactions_df = pd.DataFrame({'item_id': 'mock-item', 'user_id': user_ids})
ratings_with_mocks_df = ratings_with_mocks_df.append(mock_item_interactions_df)

# Export the CSV
ratings_with_mocks_df.to_csv('data/ratings-w-mocks.csv', quoting=csv.QUOTE_NONNUMERIC, index=True)

Using this CSV, I successfully generated a CoreML model using CreateML.

mpmontanez
  • 91
  • 6
  • I got this working with your example `pandas` workflow, but have you tried to add rating column for MLRecommender training? whenever I add this column I got back same error. without rating column model trains without issues – Paweł Madej Jul 05 '20 at 17:03
  • 1
    Yes, I only tried it with ratings of 0 and 1 and found that all of the 'mock' ratings had to be 1. If they were 0, I would get the originally error. – mpmontanez Jul 06 '20 at 23:05
  • Hmm maybe this is the problem i hit lastly. Thanks a lot for this hint. I will check it today – Paweł Madej Jul 08 '20 at 04:19
  • It works ... for the first time normalisation of data + rating > 0 yay :) – Paweł Madej Jul 08 '20 at 13:37
  • I have wrote little post about this issue. hope it will help others https://www.pawelmadej.com/post/mlrecommender-in-practice/ – Paweł Madej Jul 08 '20 at 16:52
  • 1
    Great! Glad it worked and nice post, Pawel. Thanks for the shout out. – mpmontanez Jul 09 '20 at 05:29
0

Try adding unnamed first column to your csv data which counts rows from 0 ... number of items - 1

like

"","userID","itemID","rating"
0,"a","x",1
1,"a","y",0
...

I think today after adding this column it started working for me. I use UUID for userID and itemID in my training model. and be sure to sort rows by itemID so all for one itemID are close to each other

Paweł Madej
  • 1,229
  • 23
  • 42
  • Thanks for the suggestion Pawel. I tried creating an index column, using UUIDs, and sorting by the item ID, but I'm still receiving the same error posted above. Are your item IDs unique across the entire dataset? I'm assuming that user_id to item_id should be a many:many relationship (multiple users can rate a single quote, a single user can rate multiple quotes), but I'm wondering if this assumption is incorrect. – mpmontanez Jun 19 '20 at 18:55
  • i have many:many relation in this data set. I prepared script to generate test model for this purpose: https://gist.github.com/nysander/03609236c22c935bb25a91a9a6afd20e – Paweł Madej Jun 19 '20 at 21:29
  • 1
    Thanks Pawel. I discussed the issue with Apple's engineers and posted the answer. – mpmontanez Jun 23 '20 at 22:56