6

Suppose that you have a pandas DataFrame which has some kind of data in the body and numbers in the column and index names.

>>> data=np.array([['a', 'b', 'c'], ['d', 'e', 'f'], ['g', 'h', 'i']])
>>> columns = [2, 4, 8]
>>> index = [10, 4, 2]
>>> df = pd.DataFrame(data, columns=columns, index=index)
>>> df
    2  4  8
10  a  b  c
4   d  e  f
2   g  h  i

Now suppose we want to manipulate are data frame in some kind of way based on comparing the index and columns. Consider the following.

Where index is greater than column replace letter with 'k':

    2  4  8
10  k  k  k
4   k  e  f
2   g  h  i

Where index is equal to column replace letter with 'U':

    2  4  8
10  k  k  k
4   k  U  f
2   U  h  i

Where column is greater than index replace letter with 'Y':

    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

To keep the question useful to all:

  • What is a fast way to do this replacement?

  • What is the simplest way to do this replacement?

Speed Results from minimal example

  • jezrael: 556 µs ± 66.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  • user3471881: 329 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

  • thunderwood: 4.65 ms ± 252 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Is this a duplicate? I searched google for pandas replace compare index column and the top results are:

Pandas - Compare two dataframes and replace values matching condition

Python pandas: replace values based on location not index value

Pandas DataFrame: replace all values in a column, based on condition

However, I don't feel any of these touch on whether this a) possible or b) how to compare in such a way

akozi
  • 425
  • 1
  • 7
  • 20
  • 2
    Maybe I'm stupid but your expected output doesn't match your conditions to me. For example: why don't you add a `k` to all rows where index is 10 and why is index 2 greater than column 2? And why does all rows for index 2 get the value `Y`? – user3471881 Nov 02 '18 at 13:02
  • 2
    @user3471881 totally correct, think I got it fixed. They looked right when I looked it over... completely wrong though. Thanks +1 – akozi Nov 02 '18 at 13:06
  • 1
    I updated my code a lil bit so should be quicker now. – user3471881 Nov 02 '18 at 14:15

3 Answers3

9

I think you need numpy.select with broadcasting:

m1 = df.index.values[:, None] > df.columns.values
m2 = df.index.values[:, None] == df.columns.values


df = pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)
print (df)
    2  4  8
10  k  k  k
4   k  U  Y
2   U  Y  Y

Performance:

np.random.seed(1000)

N = 1000
a = np.random.randint(100, size=N)
b = np.random.randint(100, size=N)

df = pd.DataFrame(np.random.choice(list('abcdefgh'), size=(N, N)), columns=a, index=b)
#print (df)

def us(df):
    values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')
    greater = values > 0
    less = values < 0
    same = values == 0

    values[greater] = 'k'
    values[less] = 'Y'
    values[same] = 'U'


    return pd.DataFrame(values, columns=df.columns, index=df.index)

def jez(df):

    m1 = df.index.values[:, None] > df.columns.values
    m2 = df.index.values[:, None] == df.columns.values
    return pd.DataFrame(np.select([m1, m2], ['k','U'], 'Y'), columns=df.columns, index=df.index)

In [236]: %timeit us(df)
107 ms ± 358 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [237]: %timeit jez(df)
64 ms ± 299 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    This crashes for me if `columns` and `rows` are of different length. – user3471881 Nov 02 '18 at 14:18
  • 1
    @user3471881 - Thank you, added solution - casting to numpy arrays by `.values`. Also it should be better performance in bigger DataFrame. – jezrael Nov 02 '18 at 14:26
  • Would this method work if some of the original data was to be maintained? It seems like it is using the fact that all the points are changed here. – akozi Nov 02 '18 at 14:36
  • 1
    I'll give you the checkmark back for the same reason I gave it to user :) – akozi Nov 02 '18 at 14:41
  • @akozi - Can you explain more? – jezrael Nov 02 '18 at 14:41
  • 1
    When you use np.select all of the points not covered by `m1` and `m2` are filled by the value 'Y'. If the final filter replacing to 'Y' was not used could you show an example of how you would use your answer to also output the value 'f', 'h', and 'i'? – akozi Nov 02 '18 at 14:47
  • @akozi - then add msks like `m3 = df.index.values[:, None] == df.columns.values * 2` and `df = pd.DataFrame(np.select([m1, m2, m3], ['k','U', 'f'], 'Y'), columns=df.columns, index=df.index)` – jezrael Nov 02 '18 at 14:49
  • @akozi I think I was wondering the same thing. I think you can just leave the `default` value as the `df` and it will maintain those values in the array if the masks don't change them. `df = pd.DataFrame(np.select([m1, m2], ['k','U', 'f'], df), columns=df.columns, index=df.index)` – bkeesey Feb 21 '22 at 20:37
2

Not sure about the fastest way to accomplish this but an incredibly simple way would be to just iterate over the dataframe like such:

for i in df.index:
    for j in df.columns:
        if i>j:
            df.loc[i,j]='k'
        elif j>i:
            df.loc[i,j]='y'
        else:
            df.loc[i,j]='u'
Thunderwood
  • 525
  • 2
  • 9
1

1. Using np.arrays + np.select:

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]))

greater = values > 0
same = values == 0

df = pd.DataFrame(np.select([greater, same], ['k', 'U'], 'Y'), columns=df.columns, index=df.index)

2. Using np.arrays and manual masking.

values = np.array(np.array([df.index]).transpose() - np.array([df.columns]), dtype='object')

greater = values > 0
less = values < 0
same = values == 0

values[greater] = 'k'
values[less] = 'Y'
values[same] = 'U'


df = pd.DataFrame(values, columns=df.columns, index=df.index)
user3471881
  • 2,614
  • 3
  • 18
  • 34
  • Thanks for the answer. What package are you using for speed? I was going to add different speeds to the OP, and I quite like the output yours gives. – akozi Nov 02 '18 at 14:15
  • I think I will give this the checkmark for now since it is faster and roughly the same complexity. Hard to quantify the complexity I don't think yours or Thunderwood's answer is any harder to read. – akozi Nov 02 '18 at 14:28
  • @akozi - What is size of your real DataFrame? What is performance in your real data? Because testing in small data sample should be different like in bigger df. – jezrael Nov 02 '18 at 14:30
  • Would this work if not all of the points were changed from their original. Personally, my data sets range around 500x500. Still small enough that the differences in speeds between the methods are not very noticeable. – akozi Nov 02 '18 at 14:38