2

I need to create a sort of similarity matrix based on user_id values. I am currently using Pandas to store the majority of my data, but I know that iteration is very anti-pattern, so I am considering creating a set/dictionary nest to store the similarities, similar to some of the proposed structures here

I would only be storing N nearest similarities, so it would amount to something like this:

{
 'user_1' : {'user_2':0.5, 'user_4':0.9, 'user_3':1.0},
 
 'user_2' : ...

}

It would be allowing me to access a neighbourhood by doing dict_name[user_id] quite easily.

Essentially the outermost dictionary key would hold a user_id which returns another dictionary of its N closest neighbours with user_id- similarity_value key-value sets.

For more context, I'm just writing a simple KNN recommender. I am doing it from scratch as I've tried using Surpriselib and sklearn but they don't have the context-aware flexibility I require.

This seems like a reasonable way to store these values to me, but is it very anti-pythonic, or should I be looking to do this using some other structures (e.g. NumPy or Pandas or something else I don't yet know about)?

Alex K
  • 129
  • 6
  • 1
    The answer depends a bit on the sizes of the dicts and how you plan to use it. There's nothing wrong with nested dicts as such. I would probably be reaching for something like [KDTree](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html) though... – chthonicdaemon Apr 14 '21 at 12:38

1 Answers1

1

As the comment says, there is nothing inherently wrong or anti-pythonic with using (one level of) nested dictionaries and writing everything from scratch.

Performance-wise you can probably beat your self-written solution if you use an existing data structure whose API works well with the transformations/operations you intend to perform on them. Numpy/Pandas only will help if your operations can be expressed as vectorized operations that operate on all (pairs of) elements along a common axis, e.g. all users in your top-level dictionary.

ojdo
  • 8,280
  • 5
  • 37
  • 60
  • Hmm that sounds fair. My problem is that my similarity metric is based on two separate vectors which get combined linearly. I've found it difficult to find an API or module that allows me to do this though. – Alex K Apr 14 '21 at 14:06
  • 1
    Add some code of how you would do such that in your hand-written data model (possibly in a follow-up question), then one could estimate how a similar operation could be performed in package X. – ojdo Apr 14 '21 at 14:45