Pandas DataFrame change a value based on column, index values comparison

Question

Suppose that you have a pandas DataFrame which has some kind of data in the body and numbers in the column and index names.

>>> data=np.array([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
>>> columns = [2, 4, 8]
>>> index = [10, 4, 2]
>>> df = pd.DataFrame(data, columns=columns, index=index)
>>> df
    2  4  8
10  a  b  c
4   d  e  f
2   g  h  i

Now suppose we want to manipulate are data frame in some kind of way based on comparing the index and columns. Consider the following.

Where index is greater than column replace letter with 'k':

    2  4  8
10  k  k  k
4   k  e  f
2   g  h  i

Where index is equal to column replace letter with 'U':

    2  4  8
10  k  k  k
4   k  U  f
2   U  h  i

Where column is greater than index replace letter with 'Y':

    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

To keep the question useful to all:

What is a fast way to do this replacement?
What is the simplest way to do this replacement?

Speed Results from minimal example

jezrael: 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
user3471881: 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
thunderwood: 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Is this a duplicate? I searched google for pandas replace compare index column and the top results are:

Pandas - Compare two dataframes and replace values matching condition

Python pandas: replace values based on location not index value

Pandas DataFrame: replace all values in a column, based on condition

However, I don't feel any of these touch on whether this a) possible or b) how to compare in such a way

Maybe I'm stupid but your expected output doesn't match your conditions to me. For example: why don't you add a `k` to all rows where index is 10 and why is index 2 greater than column 2? And why does all rows for index 2 get the value `Y`? — user3471881, Nov 02 '18 at 13:02
@user3471881 totally correct, think I got it fixed. They looked right when I looked it over... completely wrong though. Thanks +1 — akozi, Nov 02 '18 at 13:06

jezrael · Accepted Answer · 2018-11-02T14:40:44.787

9

I think you need numpy.select with broadcasting:

m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values


df = pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
print (df)
    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

Performance:

np.random.seed(1000)

N = 1000
a = np.random.randint(100, size=N)
b = np.random.randint(100, size=N)

df = pd.DataFrame(np.random.choice(list('abcdefgh'), size=(N, N)), columns=a, index=b)
#print (df)

def us(df):
    values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
    greater = values > 0
    less = values < 0
    same = values == 0

    values[greater] = 'k'
    values[less] = 'Y'
    values[same] = 'U'


    return pd.DataFrame(values, columns=df.columns, index=df.index)

def jez(df):

    m1 = df.index.values[:, None] > df.columns.values
    m2 = df.index.values[:, None] == df.columns.values
    return pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)

In [236]: %timeit us(df)
107 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [237]: %timeit jez(df)
64 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

edited Nov 02 '18 at 14:40

answered Nov 02 '18 at 13:03

jezrael

822,522
95
1,334
1,252

1

This crashes for me if `columns` and `rows` are of different length. – user3471881 Nov 02 '18 at 14:18
1

@user3471881 - Thank you, added solution - casting to numpy arrays by `.values`. Also it should be better performance in bigger DataFrame. – jezrael Nov 02 '18 at 14:26
Would this method work if some of the original data was to be maintained? It seems like it is using the fact that all the points are changed here. – akozi Nov 02 '18 at 14:36
1

I'll give you the checkmark back for the same reason I gave it to user :) – akozi Nov 02 '18 at 14:41
@akozi - Can you explain more? – jezrael Nov 02 '18 at 14:41
1

When you use np.select all of the points not covered by `m1` and `m2` are filled by the value 'Y'. If the final filter replacing to 'Y' was not used could you show an example of how you would use your answer to also output the value 'f', 'h', and 'i'? – akozi Nov 02 '18 at 14:47
@akozi - then add msks like `m3 = df.index.values[:, None] == df.columns.values * 2` and `df = pd.DataFrame(np.select([m1, m2, m3], ['k','U', 'f'], 'Y'), columns=df.columns, index=df.index)` – jezrael Nov 02 '18 at 14:49
@akozi I think I was wondering the same thing. I think you can just leave the `default` value as the `df` and it will maintain those values in the array if the masks don't change them. `df = pd.DataFrame(np.select([m1, m2], ['k','U', 'f'], df), columns=df.columns, index=df.index)` – bkeesey Feb 21 '22 at 20:37

score 2 · Answer 2 · answered Nov 02 '18 at 13:09

2

Not sure about the fastest way to accomplish this but an incredibly simple way would be to just iterate over the dataframe like such:

for i in df.index:
    for j in df.columns:
        if i>j:
            df.loc[i,j]='k'
        elif j>i:
            df.loc[i,j]='y'
        else:
            df.loc[i,j]='u'

answered Nov 02 '18 at 13:09

Thunderwood

525
2
9

3

This is ~8 times slower than using `numpy` as demonstrated by @jezrael. – user3471881 Nov 02 '18 at 13:16
2

true, but the question asked for both the fastest, and the simplest way of doing things. This is so simple a total beginner that just picked up python this week could understand it. – Thunderwood Nov 02 '18 at 13:17
I didn't mean it as a negative comment, just added it cuz OP asked. – user3471881 Nov 02 '18 at 13:17
ah okay understood – Thunderwood Nov 02 '18 at 13:18

user3471881 · Answer 3 · 2018-11-02T15:11:26.613

1

1. Using np.arrays + np.select:

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]))

greater = values > 0
same = values == 0

df = pd.DataFrame(np.select([greater, same], ['k', 'U'], 'Y'), columns=df.columns, index=df.index)

2. Using np.arrays and manual masking.

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')

greater = values > 0
less = values < 0
same = values == 0

values[greater] = 'k'
values[less] = 'Y'
values[same] = 'U'


df = pd.DataFrame(values, columns=df.columns, index=df.index)

edited Nov 02 '18 at 15:11

answered Nov 02 '18 at 13:59

user3471881

2,614
3
18
34

Thanks for the answer. What package are you using for speed? I was going to add different speeds to the OP, and I quite like the output yours gives. – akozi Nov 02 '18 at 14:15
I think I will give this the checkmark for now since it is faster and roughly the same complexity. Hard to quantify the complexity I don't think yours or Thunderwood's answer is any harder to read. – akozi Nov 02 '18 at 14:28
@akozi - What is size of your real DataFrame? What is performance in your real data? Because testing in small data sample should be different like in bigger df. – jezrael Nov 02 '18 at 14:30
Would this work if not all of the points were changed from their original. Personally, my data sets range around 500x500. Still small enough that the differences in speeds between the methods are not very noticeable. – akozi Nov 02 '18 at 14:38

Pandas DataFrame change a value based on column, index values comparison

3 Answers3

Linked