0

I have some challenge in my work regarding working with pandas.

I have a pandas dataframe with columns Time, x, y, a, b.

For simplicity, the dataframe has 5 records, from which 3 records are fully filled (Time, x,y,a,b has data). Other 2 a,b are empty. Time is unidirectional

I'd like to perform a calculation on some condition on Time (lets say Time > 3) and store result on a and b (lets say the functions are a=x^2, b=x^3). The calculation of a,b shall be performed in single function (I'm using lambda function). For example

Time    x   y       a        b
0.3     0   1     2.0      3.0
1.5     4   5     6.0      7.0
2.8     8   9    10.0     11.0
3.3     8  13    None     None
4.5     3  17    None     None

Shall be converted to

Time    x   y       a        b
0.3     0   1     2.0      3.0
1.5     4   5     6.0      7.0
2.8     8   9    10.0     11.0
3.3     8  13    64.0    512.0
4.5     3  17     9.0     27.0

Any assistance would be appreciated

Notes:

  • number of records here are for simplicity and the code shall be general for every number of records.
  • need that the code will be optimized for performance.
user1977050
  • 496
  • 1
  • 6
  • 18

3 Answers3

3

Sure - you can use .loc with a suitable filter to select the rows and columns to work on, e.g.:

df.loc[df['x'] > 10, ['a', 'b']] = (
    df.loc[df['x'] > 10, 'x'].map(lambda x: (x ** 2, x ** 3))
)

Note how returning a tuple from the lambda and having ['a', 'b'] in the assignment side lets you do both calculations at once.

AKX
  • 152,115
  • 15
  • 115
  • 172
  • My data, functions are more complicate, but this shall be more than enough for me in order to continue – user1977050 Aug 27 '23 at 15:07
  • Which pandas version did you use? `map` doesn't allow unpacking in the latest version. Also constructing a Series of tuples with apply/map is not vectorized. – mozway Aug 27 '23 at 16:02
  • @mozway pandas version 1.3.5 – user1977050 Aug 28 '23 at 05:11
  • @user1977050 my question was originally for AKX. That said, this is a very old version, you should consider updating ;) Keep in mind that this solution is both inefficient and doesn't work on a recent pandas. You said in the question that you needed an efficient approach, `map` is not. – mozway Aug 28 '23 at 05:19
0

Use simple boolean indexing and vectorial code:

m = df['Time'].gt(3)
x = df.loc[m, 'x']

df.loc[m, 'a'] = x**2
df.loc[m, 'b'] = x**3

Output:

   Time  x   y     a      b
0   0.3  0   1   2.0    3.0
1   1.5  4   5   6.0    7.0
2   2.8  8   9  10.0   11.0
3   3.3  8  13  64.0  512.0
4   4.5  3  17   9.0   27.0

Comparison of efficiency with a map (without assignment as this raises a TypeError anyway on pandas 2.0.1):

enter image description here

mozway
  • 194,879
  • 13
  • 39
  • 75
  • The example is simplified. In my case, the values of `a,b` are calculated based on some other values of the data frame, which are not related to x and may be based on object values (i.e. if `c='LOSS' then b='TEST'`) – user1977050 Aug 28 '23 at 10:22
  • @user1977050 can you provide an example? There are pandas methods to do that efficiently (i.e without a loop/map/apply) – mozway Aug 28 '23 at 11:07
  • To be more specific: data frame contains time stamp. I'd like to find all records where timestamp is greater than X. On those records, apply function to some columns which will give a result to other column(s). (In the example above, x is for time stamp) – user1977050 Aug 28 '23 at 11:14
  • Can you [edit](https://stackoverflow.com/posts/76987365/edit) your question with a reproducible example? – mozway Aug 28 '23 at 11:23
  • Question has been updated – user1977050 Aug 28 '23 at 11:26
  • But you didn't provide a [reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – mozway Aug 28 '23 at 11:28
  • Try now :) :) :) – user1977050 Aug 28 '23 at 11:38
  • @user1977050 I updated the code, see it's almost the same, just the condition changes. You can use a condition on multiple columns if needed. No need for `map` or `apply` that are slow. – mozway Aug 28 '23 at 11:47
  • im guessing my answer probably slower than yours, but just thought i would throw another solution in the mix :) – rhug123 Aug 28 '23 at 19:03
-1

Here is a way:

df.fillna(df[['x','x']].pow([2,3]).set_axis(list('ab'),axis=1).where(df['Time'].gt(3)))

Output:

   Time  x   y     a      b
0   0.3  0   1   2.0    3.0
1   1.5  4   5   6.0    7.0
2   2.8  8   9  10.0   11.0
3   3.3  8  13  64.0  512.0
4   4.5  3  17   9.0   27.0
rhug123
  • 7,893
  • 1
  • 9
  • 24