2

I have a Pandas Dataframe question. I have a df with index=column. It looks like below.

df:
       DNA      Cat2                                   
       Item     A   B   C   D   E   F   F   H   I   J  .......
DNA   Item
Cat2  A         812 62  174 0   4   46  46  7   2   15 
      B         62  427 27  0   0   12  61  2   4   11 
      C         174 27  174 0   0   13  22  5   2   4  
      D         0   0   0   0   0   0   0   0   0   0  
      E         4   0   0   0   130 10  57  33  4   5  
      F         46  12  13  0   10  187 4   5   0   0 
      ......

Another words, df=df.transpose(). All I want to do is find pandas (or numpy for df.values())function to delete index=column values. My ideal output would be below.

df:
       DNA      Cat2                                   
       Item     A   B   C   D   E   F   F   H   I   J  .......
DNA   Item
Cat2  A         0   62  174 0   4   46  46  7   2   15 
      B         62  0   27  0   0   12  61  2   4   11 
      C         174 27  0   0   0   13  22  5   2   4  
      D         0   0   0   0   0   0   0   0   0   0  
      E         4   0   0   0   0   10  57  33  4   5  
      F         46  12  13  0   10  0   4   5   0   0 
      ......

Is there a python function that makes this step very fast? I tried for loop with df.iloc[i,i]=0 but since my dataset is ver big, it takes long time to finish. Thanks in advance!

EJ Kang
  • 455
  • 2
  • 5
  • 17

1 Answers1

3

Setup

np.random.seed([3,1415])
i = pd.MultiIndex.from_product(
    [['Cat2'], list('ABCDEFGHIJ')],
    names=['DNA', 'Item']
)

a = np.random.randint(5, size=(10, 10))
df = pd.DataFrame(a + a.T + 1, i, i)

df

DNA       Cat2                           
Item         A  B  C  D  E  F  G  H  I  J
DNA  Item                                
Cat2 A       1  6  6  7  7  7  4  4  8  2
     B       6  1  3  6  1  6  6  4  8  5
     C       6  3  9  8  9  6  7  8  4  9
     D       7  6  8  1  6  9  4  5  4  3
     E       7  1  9  6  9  7  3  7  2  6
     F       7  6  6  9  7  9  3  4  6  6
     G       4  6  7  4  3  3  9  4  5  5
     H       4  4  8  5  7  4  4  5  4  5
     I       8  8  4  4  2  6  5  4  9  7
     J       2  5  9  3  6  6  5  5  7  3

Option 1
Simplest way is to multiply by 1 less the identity

df * (1 - np.eye(len(df), dtype=int))

DNA       Cat2                           
Item         A  B  C  D  E  F  G  H  I  J
DNA  Item                                
Cat2 A       0  6  6  7  7  7  4  4  8  2
     B       6  0  3  6  1  6  6  4  8  5
     C       6  3  0  8  9  6  7  8  4  9
     D       7  6  8  0  6  9  4  5  4  3
     E       7  1  9  6  0  7  3  7  2  6
     F       7  6  6  9  7  0  3  4  6  6
     G       4  6  7  4  3  3  0  4  5  5
     H       4  4  8  5  7  4  4  0  4  5
     I       8  8  4  4  2  6  5  4  0  7
     J       2  5  9  3  6  6  5  5  7  0

Option 2
However, we can also use pd.DataFrame.mask with np.eye. Masking is nice because it doesn't have to be numeric and it will still work.

df.mask(np.eye(len(df), dtype=bool), 0)

DNA       Cat2                           
Item         A  B  C  D  E  F  G  H  I  J
DNA  Item                                
Cat2 A       0  6  6  7  7  7  4  4  8  2
     B       6  0  3  6  1  6  6  4  8  5
     C       6  3  0  8  9  6  7  8  4  9
     D       7  6  8  0  6  9  4  5  4  3
     E       7  1  9  6  0  7  3  7  2  6
     F       7  6  6  9  7  0  3  4  6  6
     G       4  6  7  4  3  3  0  4  5  5
     H       4  4  8  5  7  4  4  0  4  5
     I       8  8  4  4  2  6  5  4  0  7
     J       2  5  9  3  6  6  5  5  7  0

Option 3
In the event the columns and indices are not identical, OR the are out of order. We can use equality to tell us where to mask.

d = df.iloc[::-1]
d.mask(d.index == d.columns.values[:, None], 0)

DNA       Cat2                           
Item         A  B  C  D  E  F  G  H  I  J
DNA  Item                                
Cat2 J       2  5  9  3  6  6  5  5  7  0
     I       8  8  4  4  2  6  5  4  0  7
     H       4  4  8  5  7  4  4  0  4  5
     G       4  6  7  4  3  3  0  4  5  5
     F       7  6  6  9  7  0  3  4  6  6
     E       7  1  9  6  0  7  3  7  2  6
     D       7  6  8  0  6  9  4  5  4  3
     C       6  3  0  8  9  6  7  8  4  9
     B       6  0  3  6  1  6  6  4  8  5
     A       0  6  6  7  7  7  4  4  8  2
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I understand all the diagnal values are multiply by zero, but would other values be effected by multiply? – EJ Kang Jan 11 '18 at 06:14
  • They get multiplied by 1 – piRSquared Jan 11 '18 at 06:15
  • Wow. thanks for all the options. How would you rank as speed wise? – EJ Kang Jan 11 '18 at 06:35
  • Not sure. None of these were intended to be super-duper fast. But there isn't anything really slowing them down. I suspect they are all similar. Let me know if you need faster. If so, I'll post a tailored solution for speed. – piRSquared Jan 11 '18 at 06:38