6

I have a big sparse matrix. I want to take log4 for all element in that sparse matrix.

I try to use numpy.log() but it doesn't work with matrices.

I can also take logarithm row by row. Then I crush old row with a new one.

# Assume A is a sparse matrix (Linked List Format) with float values as data
# It is only for one row

import numpy as np
c = np.log(A.getrow(0)) / numpy.log(4)
A[0, :] = c

This was not as quick as I'd expected. Is there a faster way to do this?

Baskaya
  • 7,651
  • 6
  • 29
  • 27

2 Answers2

10

You can modify the data attribute directly:

>>> a = np.array([[5,0,0,0,0,0,0],[0,0,0,0,2,0,0]])
>>> coo = coo_matrix(a)
>>> coo.data
array([5, 2])
>>> coo.data = np.log(coo.data)
>>> coo.data
array([ 1.60943791,  0.69314718])
>>> coo.todense()
matrix([[ 1.60943791,  0.        ,  0.        ,  0.        ,  0.        ,
          0.        ,  0.        ],
        [ 0.        ,  0.        ,  0.        ,  0.        ,  0.69314718,
          0.        ,  0.        ]])

Note that this doesn't work properly if the sparse format has repeated elements (which is valid in the COO format); it'll take the logs individually, and log(a) + log(b) != log(a + b). You probably want to convert to CSR or CSC first (which is fast) to avoid this problem.

You'll also have to add checks if the sparse matrix is in a different format, of course. And if you don't want to modify the matrix in-place, just construct a new sparse matrix as you did in your answer, but without adding 3 because that's completely unnecessary here.

Danica
  • 28,423
  • 6
  • 90
  • 122
  • What are the differences between my solution and your solution? You propose only that 3 isn't necessary which has already proposed as a comment by you in my solution. – Baskaya Mar 24 '12 at 19:53
  • @Thorn I actually initially misread your solution (thought you were adding 3 to the entire matrix and so doing a whole lot of unnecessary logarthms). You're right that they're basically the same. – Danica Mar 24 '12 at 20:42
  • Grateful for this answer as the example makes it very clear. Even if it is essentially the same answer, it's good to have it here. – P i Apr 21 '21 at 06:41
0

I think I solve it with very easy way. It is very strange that no one could answer immediately.

# Let A be a COO_matrix
import numpy as np
from scipy.sparse import coo_matrix
new_data = np.log(A.data+3)/np.log(4) #3 is not so important. It can be 1 too.
A = coo_matrix((new_data, (A.row, A.col)), shape=A.shape)
Baskaya
  • 7,651
  • 6
  • 29
  • 27
  • 6
    No-one suggested this solution because it is mathematically incorrect. `log(x)` could be very different from `log(x+1)`! (example: `log(0.000001) = -6`, `log(0.0000001 + 1 = 0 and a bit`. – Li-aung Yip Mar 23 '12 at 02:02
  • Sorry for ill-posed question. I didn't mention that all data are positive and bigger than 1. This are the values of TF(term frequency) matrices. I think there will be no problem. – Baskaya Mar 23 '12 at 10:29
  • There's absolutely no reason to add 3 (or anything) here, since none of the entries in `A.data` will be 0. But if you do want to take the approach of adding a constant, use a smaller one! Adding say `1e-16` will have the same effect of never taking `log(0)` but with much less error introduced: using the [appropriate identity](http://en.wikipedia.org/wiki/List_of_logarithmic_identities#Summation.2Fsubtraction), it's `log(x + eps) = log(x) + log(1 + eps/a)`, where the error introduced is near 0 if `eps/a` is almost 0. – Danica Mar 23 '12 at 14:37
  • @Dougal thank you but I want to add `3` because I don't want to make these 1's zero after logarithm. It is a design concern of term matrices and it is related to `text processing`. I do not want to add very small value because, for me, 1s are more important than you expect. Besides, It doesn't change much. – Baskaya Mar 24 '12 at 19:24
  • @Thorn I'm not sure exactly what you're doing with the logarithms here, but if you're doing any kind of NLP algorithm that uses the log, it's going to be doing the wrong thing if you're not actually giving it the log. In this case, it'll probably end up overweighting things with only 1 observed count. If the problem is simply that you want to distinguish between when the original entry had a 1 and when it didn't exist, you might want to think about maintaining a list of entries yourself and throwing them in a COO sparse matrix when you want to do matrix operations. – Danica Mar 24 '12 at 20:47
  • @Thorn If you actually want to overcount things with a value of 1, you probably want to use a more justified approach to smoothing the data: e.g. using [pseudocounts](http://en.wikipedia.org/wiki/Pseudocount) obtained through an algorithm like [Good-Turing](http://en.wikipedia.org/wiki/Good-Turing_frequency_estimation), which is [implemented in nltk](http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.GoodTuringProbDist-class.html). – Danica Mar 24 '12 at 20:49
  • @Dougal Thanks for your advice. I will check it. As I said before, there won't be a problem when you add `3` to every nonzero value. – Baskaya Mar 24 '12 at 22:58
  • I explain my job: I want to build a tf-idf matrix which contains number of word count of a Wikipedia article and I saw that some words such as article's title can be a problem. For example: In _The Matrix_ movie, in spite of the fact that occurences of word _matrix_ is 112 and occurences of word _dream_ is 4, this is not the case that _matrix_ is 28 times more important than _dream_. I want to eliminate this huge gaps and I pick log as a basic solution. If a take log without adding, then I lost my occurences. This simple approach, adding, is also valid for idf part of tf-idf calculation. – Baskaya Mar 24 '12 at 22:58
  • @Dougal I illustrate why I add 3 but actually this is a simple version of my story but I presume that you get the idea. – Baskaya Mar 24 '12 at 23:00
  • @Thorn Oh, okay -- if you're just using the log completely arbitrarily anyway, then sure, it really doesn't matter if you just add three. There are definitely more mathematucally justified ways to deal with this problem, though. I'm not sure how you're using these log(TF-IDF) matrices for inference, but you could consider adding in a Bayesian prior that the name of an article is going to be used more often in the article, for example. – Danica Mar 25 '12 at 02:58