0

I have the following statement in Pandas that uses the apply method which can take up to 2 minutes long.

I read that in order to optimize the speed. I should vectorize the statement. My original statement looks like this:

output_data["on_s"] = output_data["m_ind"].apply(lambda x: my_matrix[x, 0] + my_matrix[x, 1] + my_matrix[x, 2]

Where my_matrix is spicy.sparse matrix. So my initial step was to use the sum method:

summed_matrix = my_matrix.sum(axis=1)

But then after this point I get stuck on how to proceed.

Update: Including example data

The matrice looks like this (scipy.sparse.csr_matrix):


(290730, 2)     0.3058016922838267
(290731, 2)     0.3390328430763723
(290733, 2)     0.0838999800585995
(290734, 2)     0.0237008960604337
(290735, 2)     0.0116864263235209

output_data["m_ind"] is just a Pandas series that has come values like so:

97543
97544
97545
97546
97547
mp252
  • 453
  • 1
  • 6
  • 18
  • 2
    Please share a sample of `output_data` and `my_matrix`. Always provide a  [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) when asking for help, so people can understand clearly what you want and reproduce your problem. This should be useful: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Rodalm Jun 13 '22 at 22:44
  • @Rodalm added data! – mp252 Jun 14 '22 at 10:53

1 Answers1

1

An Update: You have convert sparse matrix into dense matrix first

Since you haven't provided any reproducible code I can't understand what is your problem exactly and give you an very very precise answer. But I will answer according to my understanding. Let's assume your my_matrix is some thing like this

[[1,2,3],
 [4,5,6],
 [7,8,9]]

then the summed_matrix will be like [6,15,24]. So if my assumption is right it looks like you are almost there.

First I'll give you the simplest answer. Try using this line.

output_data["on_s"] = output_data["m_ind"].apply(lambda x: summed_matrix[x])

Then we can try this completely vectorized method.

  1. Turn m_ind into a one hot encoded array ohe_array. Be careful to generate ohe_array according to the increasing order (sorted). Otherwise you will get the wrong answer. refer this
  2. Then get the dot product of ohe_array and summed_matrix. refer this
  3. Assign the result into the column on_s

Also We can try the following code and compare performance against apply.

indexes = output_data["m_ind"].values
on_s = []
for i in indexes:
    on_s.append(summed_matrix[i])

output_data['on-s'] = on_s
Kavindu Ravishka
  • 711
  • 4
  • 11