0

I have a pandas dataframe with about 3000 columns. The first column lists a category (the values can be repeated).

The second column all the way to the last column lists 1s and 0s (its somewhat of an indicator matrix). There are 20 or less 1s per row, so I am dealing with a sparse matrix.

I want to create a dictionary such that, when given a particular category, it gives you a matrix of the cosine distances of all the indicator vectors in the category (with the order from the data frame preserved). My data has about 100,000 rows as well, so I'm looking for an efficient way to do this.

Thanks

Green
  • 393
  • 1
  • 14
  • I tired iterating over the rows through loops and making matrices by appending rows below one another and then using scipy.spatial.distance after making the matrix sparse. This ended up being pretty slow though. – Green Jul 14 '16 at 20:28
  • Does [this post](http://stackoverflow.com/questions/17627219/whats-the-fastest-way-in-python-to-calculate-cosine-similarity-given-sparse-mat) do what you want? – Jarad Jul 14 '16 at 21:32
  • That's part of it, but my question is also how to efficiently go from the dataframe to a dictionary with the category as the key and the matrix as the value. If I can get the indicator matrix (split by category) stored as values in the dictionary, then I can implement that post to find the cosine similarity, but I need help getting to that point first. – Green Jul 14 '16 at 21:35
  • Did you try using `groupby(category).apply(function_to_generate_cos_distances)`? That would make your life much easier. If you find difficulty using it, post a simple dataframe object with 10 rows x 10 columns and tell what is the output you are expecting, so someone can suggest an efficient way to do it. At least posting your current code would help. – Sreyantha Chary Jul 15 '16 at 07:58

0 Answers0