0

Based on Sort pandas DataFrame with function over column values

I want to apply a function such as log() to a data frame using the .assign() method to create a temporary column and use it as a sorting criteria, however, I can't pass the axis parameter like the way it works for the .apply() method.

Here's a sample code:

from numpy.random import randint

set.seed(0)
df = pd.DataFrame({'value':[randint(1,10) for i in range(0,10)], 'reading': [randint(1,10) for i in range(0,10)]})
   value  reading
0      8        6
1      5        9
2      3        7
3      8        2
4      6        1
5      4        9
6      6        2
7      3        5
8      2        2
9      8        8

I can't use .assign() method like this:

df.assign(log = log(df.value/df.reading))

    raise TypeError("cannot convert the series to " "{0}".format(str(converter)))
TypeError: cannot convert the series to <class 'float'>

or

df.assign(log = lambda x: log(x.value/x.reading))

    raise TypeError("cannot convert the series to " "{0}".format(str(converter)))
TypeError: cannot convert the series to <class 'float'>

But it works fine with .apply() method:

df.apply(lambda x: log(x.value/x.reading), axis=1)

0    0.287682
1   -0.587787
2   -0.847298
3    1.386294
4    1.791759
5   -0.810930
6    1.098612
7   -0.510826
8    0.000000
9    0.000000
dtype: float64

Any workaround to use assign or a different method to use it as a temporary column in sorting?

Nate
  • 10,361
  • 3
  • 33
  • 40
Mehdi Zare
  • 1,221
  • 1
  • 16
  • 32
  • 1
    Where are you getting `log` from? It works for me with `np.log`. – mgilson Jan 01 '20 at 16:05
  • 1
    from math import log – Mehdi Zare Jan 01 '20 at 16:06
  • 1
    `math.log` is going to expect a scalar entity -- i.e. a single `float`. Use `numpy.log` as that will work with anything that supports the array-interface (including pandas Series) – mgilson Jan 01 '20 at 16:07
  • 1
    I also have some custom functions with the same issue, it's all about passing axis=1 param. – Mehdi Zare Jan 01 '20 at 16:07
  • 1
    Thanks @mgilson, that solves part of the problem! – Mehdi Zare Jan 01 '20 at 16:09
  • 1
    yes, `DataFrame.apply` is similar to a `map` operation -- the `axis=1` says to deal with a single row at a time. In that case, `x.value` and `x.reading` are simply scalar float values so `math.log` will work. You could use `np.vectorize` your custom functions and use them with `assign` if you felt like it. – mgilson Jan 01 '20 at 16:10

1 Answers1

4

You should use vectorized function as much as you can and reserve apply(..., axis=1) as a last resort, when you have to do things row-by-row.

Your problem can be solved with np.log, which is vectorized:

df.assign(log=lambda x: np.log(x['value'] / x['reading']))

If you have a custom function, better rewrite it using vectorized functions from numpy or scipy. As a last resort, you can use np.vectorize:

import math
def my_custom_func(x):
    return math.log(x)

f = np.vectorize(my_custom_func)
df.assign(log2=lambda x: f(x['value'] / x['reading']))
Code Different
  • 90,614
  • 16
  • 144
  • 163
  • Thanks for your answer. I also tried some other basic operations such as * and have the same issue. Should I define a vectorized product function or is there a better way? for example: df.assign(prodc = lambda x: 1 if (x.value * 2 > 4) else 0) – Mehdi Zare Jan 01 '20 at 16:40
  • The `x` is the lambda is a dataframe so `x['value']` is a Series. You have to use **vectorized functions** to deal with them: `df.assign(prodc = lambda x: np.where(x['value'] * 2 > 4, 1, 0))` – Code Different Jan 01 '20 at 16:45
  • Thanks! @Code Different – Mehdi Zare Jan 01 '20 at 16:53