6

What's the most effective way to solve the following pandas problem?

Here's a simplified example with some data in a data frame:

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,size=(10, 4)), columns=['a','b','c','d'], 
                  index=np.random.randint(0,10,size=10))

This data looks like this:

   a  b  c  d
1  0  0  9  9
0  2  2  1  7
3  9  3  4  0
2  5  0  9  4
1  7  7  7  2
6  4  4  6  4
1  1  6  0  0
7  8  0  9  3
5  0  0  8  3
4  5  0  2  4

Now I want to apply some function f to each value in the data frame (the function below, for example) and get a data frame back as a resulting output. The tricky part is the function I'm applying depends on the value of the index I am currently at.

def f(cell_val, row_val):
    """some function which needs to know row_val to use it"""
    try:
        return cell_val/row_val
    except ZeroDivisionError:
        return -1

Normally, if I wanted to apply a function to each individual cell in the data frame, I would just call .applymap() on f. Even if I had to pass in a second argument ('row_val', in this case), if the argument was a fixed number I could just write a lambda expression such as lambda x: f(x,i) where i is the fixed number I wanted. However, my second argument varies depending on the row in the data frame I am currently calling the function from, which means that I can't just use .applymap().

How would I go about solving a problem like this efficiently? I can think of a few ways to do this, but none of them feel "right". I could:

  • loop through each individual value and replace them one by one, but that seems really awkward and slow.
  • create a completely separate data frame containing (cell value, row value) tuples and use the builtin pandas applymap() on my tuple data frame. But that seems pretty hacky and I'm also creating a completely separate data frame as an extra step.
  • there must be a better solution to this (a fast solution would be appreciated, because my data frame could get very large).
smci
  • 32,567
  • 20
  • 113
  • 146
K. Mao
  • 443
  • 4
  • 11
  • 1
    Sorry are you after: `df.div(df.index.to_series(), axis=0)`? – EdChum Sep 29 '16 at 15:05
  • To be clear, you don't just want to access the individual row-index values, you ideally want to **access all the row-index as a series or array, so you can use vectorized operations**. (Then you mightn't even need to declare a lambda or function `f`) In your particular case, your row-indices are all integers, but in the general case, they might be strings, categoricals, dates, datetimes etc. – smci Mar 19 '22 at 22:28

2 Answers2

2

IIUC you can use div with axis=0 plus you need to convert the Index object to a Series object using to_series:

In [121]:
df.div(df.index.to_series(), axis=0).replace(np.inf, -1)

Out[121]:
          a         b         c         d
1  0.000000  0.000000  9.000000  9.000000
0 -1.000000 -1.000000 -1.000000 -1.000000
3  3.000000  1.000000  1.333333  0.000000
2  2.500000  0.000000  4.500000  2.000000
1  7.000000  7.000000  7.000000  2.000000
6  0.666667  0.666667  1.000000  0.666667
1  1.000000  6.000000  0.000000  0.000000
7  1.142857  0.000000  1.285714  0.428571
5  0.000000  0.000000  1.600000  0.600000
4  1.250000  0.000000  0.500000  1.000000

Additionally as division by zero results in inf you need to call replace to replace those rows with -1

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • This works in the example case, but what if I had a more complicated function than simple division that could possibly fail with an error at some point? Then I wouldn't be able to just call pandas.div on my data frame. – K. Mao Sep 29 '16 at 15:12
  • You'll need to explain how this would fail as this handles `0` division – EdChum Sep 29 '16 at 15:13
  • For example, say instead of performing division my function did a lookup in another data frame and I needed to replace IndexErrors with something else. Something like "def f(x,y): try: return df2.iloc[x,y] except IndexError: return -1" – K. Mao Sep 29 '16 at 15:16
  • For that case you can just test for the intersection of the indices and where the indices are different return `-1`, e.g. `common = df1.index.intersect(df2.index)` then you can use common row values fine and for all the rest return `-1` Also what you're asking is fundamentally different to your question so you should ask another question – EdChum Sep 29 '16 at 15:20
0

Here's how you can add the index to the dataframe

pd.DataFrame(df.values + df.index.values[:, None], df.index, df.columns)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Use `df.div(df.index.array, axis=0)`. Don't use `.values`, it's [planned to be deprecated](https://pandas.pydata.org/docs/whatsnew/v0.24.0.html#accessing-the-values-in-a-series-or-index). Use `.array`. – smci Mar 19 '22 at 22:27