How to create a sparse binary matrix from a dictionary in python

Question

I have a .tsv file from which I've created a pyhton dictionary where the keys are all the movie_id and the values are the features (every movie has a different number of features).

Here's an example of my dictionary:

Goal to achieve:

From this dictionary I want to create an item-features sparse matrix to use for a recommender system project. At the end I would like to have a binary sparse matrix with 1 when a movie has a certain feature. Something like this:

My code:

To create the dictionary:

def Dictionary():
    d={}
    l=[]
    with open(filepath_mapping) as f:
        for line in f.readlines():
            line = line.split()
            key = int(line[0])
            value = [int(el) for el in line[1:]]
            d[key] = value
    return(d)

movie_features_dict = Dictionary()

To create the item-features matrix from the dictionary:

n = len(movie_features_dict)
value_lengths = [len(v) for v in movie_features_dict.values()]
d = max(value_lengths)
print(f"ITEM*FEATURES matrix shape: {n,d}\n")

item_feature_matrix = sp.dok_matrix((n,d), dtype=np.int8)

for movie_ids, features in movie_features_dict.items():
    item_feature_matrix[movie_ids, features] = 1

item_feature_matrix = item_feature_matrix.tocsr()
print(item_feature_matrix.shape)

Issues:

I have 22069 movies and the movie with the maximum number of features should have 885 features, so theoretically I should have a 22069*885 matrix, but with the code I've written I continue having this error:

raise IndexError('index (%d) out of range' % max_indx)
IndexError: index (614734) out of range

Without the data it is a bit difficult to completely reproduce the error you're getting, what is the result of the first print statement indicating the value of` `n` and `d`? I assume the error is caused by the fact that you are indexing the matrix using the value of the feature (`features`) which can be higher than the total number of features since some feature values are not present (i.e. number 2 trough 4 in your example). — Oxbowerce, Nov 01 '22 at 17:13
@Oxbowerce the result of the first print statement is "ITEM*FEATURES matrix shape: (22069, 885)", so as I've written in the 'Issues' paragraph in my question 'n' (number of movies) should be 22069, while 'd' (the maximum value of features that at least one movie has) should be 885. I think the problem is due to the fact that every movie don't have a fixed number of features but it is variable, and I don't know how to create the sparse matrix. — Pybubb, Nov 01 '22 at 17:42

blunova · Answer 1 · 2022-11-01T18:33:33.557

1

Based on this answer, you can do the following with few lines of code:

import pandas as pd

id_to_features = {
    880: [18, 23, 854, 98475, 20],
    152: [1, 578, 18, 654, 23, 5, 11],
    6654: [2088]
}

df = pd.DataFrame({"features": list(id_to_features.values())})
matrix = df['features'].apply(pd.value_counts).fillna(0).astype(int)
ids = list(id_to_features.keys())
matrix.index = ids
matrix = matrix.reindex(sorted(matrix.columns), axis=1)

EDIT

Out of curiosity, I have created a fake dataset and the code above took 7 seconds to run (using perf_counter) on a common laptop.

Here is the code for generating the dataset:

id_to_features = {
    i: [randint(1, 886) for _ in range(randint(1, 10))] for i in range(1, 22070)
}

The resulting matrix requires 78 MB of space computed using

matrix.memory_usage(index=True, deep=True).sum()

considering instead astype("int8"), it requires 20 MB.

edited Nov 01 '22 at 18:33

answered Nov 01 '22 at 17:16

blunova

2,122
3
9
21

The problem is that I can't (or don't want to) create a dataframe because the file is very big and it will take a lot of time. I've read the .tsv file as a dictionary just because the file is really big and the fact that every movie has a different number of features doesn't help me creating the sparse matrix. – Pybubb Nov 01 '22 at 17:48
Hi! Thanks for the clarification. Out of curiosity I have created a dataset having the dimension specified by you. It took 7s. Is my fake dataset plausible? Thanks! – blunova Nov 01 '22 at 18:20

score 1 · Accepted Answer · answered Nov 08 '22 at 12:08

I write this answer for future users that will have a similar problem.

As I said in comments to other answers above, the creation of a new pandas dataframe was not useful for my needs, so this is the solution I've implemented.

Based on this answer I've created the sparse matrix in this way:

from sklearn.feature_extraction import DictVectorizer
    
restructured = []
for key in movie_features_dict:
    data_dict = {}
    for feat in movie_features_dict[key]:
        data_dict[feat] = 1
    restructured.append(data_dict)

dictvectorizer = DictVectorizer(sparse=True)
matrix_item_features = dictvectorizer.fit_transform(restructured)
print(f"Item-feature matrix shape: {matrix_item_features.shape}")

You can take a view here and here to have a better understanding of how DictVectorizer works.