-1

I would like to calculate the log-ratios for my 2D array, e.g.

a = np.array([[3,2,1,4], [2,1,1,6], [1,5,9,1], [7,8,2,2], [5,3,7,8]])

The formula is ln(x/g(x)), where g(x) is the geometric mean of each row. I execute it like this:

    logvalues = np.array(a) # the values will be overwritten through the code below.
    for i in range(len(a)):
        row = np.array(a[i])
        geo_mean = row.prod()**(1.0/len(row))
        flr = lambda x: math.log(x/geo_mean)
        logvalues = np.array([flr(x) for x in row])

I was wondering if there is any way to vectorise the above lines (preferably without introducing other modules) to make it more efficient?

Alex Waygood
  • 6,304
  • 3
  • 24
  • 46
  • at the end of your code you will end up with only the last row of a. I'm guessing this is not the intended behaviour and it can be fixed by changing the last line to logvalues[i] = np.array([flr(x) for x in row]) – yann ziselman Sep 15 '21 at 09:35
  • Sorry I missed the '[i]' behind logvalues. Thanks Yann for pointing it out! – JiaruiS Sep 15 '21 at 09:44

2 Answers2

2

This should do the trick:

geo_means = a.prod(1)**(1/a.shape[1])
logvalues = np.log(a/geo_means[:, None])
yann ziselman
  • 1,952
  • 5
  • 21
  • @Reti43, ty for the correction. i edited to fix my mistake – yann ziselman Sep 15 '21 at 09:38
  • Thank you! It works perfectly! May I ask what does [:,None] mean here in 'geo_means[:, None]' ? – JiaruiS Sep 15 '21 at 09:59
  • @JiaruiS it add a new singleton dimension at the end of the array. For example if `a.shape` = (2,3) then `a[:,None].shape` = (2,3,1) and `a[None,:].shape` = (1,2,3) – obchardon Sep 15 '21 at 10:32
  • @obchardon This isn't exactly correct. `a[:,None].shape` = (2, 1, 3). `a[...,None]` or `a[:,:,None]` will add a new axis at the end. For the OP, [this](https://stackoverflow.com/questions/29241056/how-does-numpy-newaxis-work-and-when-to-use-it) explains its use and as is for most cases, it's to take advantage of [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html). – Reti43 Sep 15 '21 at 10:44
  • @JiaruiS what obchardon said + if you know what the method 'reshape' does, then this is the same as calling reshape with new dimensions with a length of 1 before, between, or after the existing dimensions. I like writing it this way because it's shorter. i am reshaping the vector 'geo_means' to make sure each element is multiplied with each row in the matrix 'a'. – yann ziselman Sep 15 '21 at 10:44
0

Another way you could do this is just write the function as though for a single 1-D array, ignoring the 2-D aspect:

def f(x):
    return np.log(x / x.prod()**(1.0 / len(x)))

Then if you want to apply it to all rows in a 2-D array (or N-D array):

>>> np.apply_along_axis(f, 1, a)
array([[ 0.30409883, -0.10136628, -0.79451346,  0.5917809 ],
       [ 0.07192052, -0.62122666, -0.62122666,  1.17053281],
       [-0.95166562,  0.65777229,  1.24555895, -0.95166562],
       [ 0.59299864,  0.72653003, -0.65976433, -0.65976433],
       [-0.07391256, -0.58473818,  0.26255968,  0.39609107]])

Some other general notes on your attempt:

  1. for i in range(len(a)): If you want to loop over all rows in an array it's generally faster to do simply for row in a. NumPy can optimize this case somewhat, whereas if you do for idx in range(len(a)) then for each index you have to again index the array with a[idx] which is slower. But even then it's better not to use a for loop at all where possible, which you already know.

  2. row = np.array(a[i]): The np.array() isn't necessary. If you index an multi-dimensional array the returned value is already an array.

  3. lambda x: math.log(x/geo_mean): Don't use math functions with NumPy arrays. Use the equivalents in the numpy module. Wrapping this in a function adds unnecessary overhead as well. Since you use this like [flr(x) for x in row] that's just equivalent to the already vectorized NumPy operations: np.log(row / geo_mean).

Iguananaut
  • 21,810
  • 5
  • 50
  • 63