2

If I have a dataframe with samples as columns and categorical values [A, H] as rows:

      Samp1   Samp2    Samp3    Samp4   Samp5                                                          
A  17.80000  56.000   0.2380  24.0000  31.700
B  29.80000  21.000  21.5000  59.7000  43.100
C   2.09000   7.450   0.0000   1.8800   4.670
D   0.00228   6.850   0.0159   4.6300  10.100
E  19.40000   2.860  11.1000   0.8160   1.220
F  10.60000   4.390  30.9000   4.5600   7.730
G   9.87000   0.501   4.2100   0.8100   0.176
H   0.10000   0.106  10.6000   1.1700   0.104

How can I collapse cell counts < threshold (e.g. 10) into an "X" category (that is putting to zero the cell and adding a cell with the sum of all under-threshold cells), like:

      Samp1   Samp2    Samp3    Samp4   Samp5                                                          
A  17.80000  56.000   0.0000  24.0000  31.700
B  29.80000  21.000  21.5000  59.7000  43.100
C   0.00000   7.450   0.0000   0.0000   0.000
D   0.00000   6.850   0.0159   0.0000  10.100
E  19.40000   2.860  11.1000   0.0000   0.000
F  10.60000   4.390  30.9000   0.0000   0.000
G   0.00000   0.501   0.0000   0.0000   0.000
H   0.00000   0.106  10.6000   0.0000   0.000
X  12.06200  22.157   4.3200  13.8660  13.900  <--

many thanks, glad to clarify if this is not clear.

Andrea T.
  • 920
  • 4
  • 15

1 Answers1

1

Use DataFrame.gt for mask, repalce values by 0 in DataFrame.where and then add column X filled by replaced not matched values to 0 in DataFrame.mask with sum:

th = 10
m = df.gt(th)
df1 = df.where(m, 0)
df1.loc['X'] = df.mask(m).sum()
print (df1)
      Samp1   Samp2    Samp3   Samp4  Samp5
A  17.80000  56.000   0.0000  24.000   31.7
B  29.80000  21.000  21.5000  59.700   43.1
C   0.00000   0.000   0.0000   0.000    0.0
D   0.00000   0.000   0.0000   0.000   10.1
E  19.40000   0.000  11.1000   0.000    0.0
F  10.60000   0.000  30.9000   0.000    0.0
G   0.00000   0.000   0.0000   0.000    0.0
H   0.00000   0.000  10.6000   0.000    0.0
X  12.06228  22.157   4.4639  13.866   13.9

EDIT:

top = 3
#https://github.com/numpy/numpy/issues/8757
m = np.argsort(np.argsort(-df, axis=0), axis=0) < top
# print (m)

df1 = df.where(m, 0)
df1.loc['X'] = df.mask(m).sum()
print (df1)
      Samp1   Samp2    Samp3   Samp4  Samp5
A  17.80000  56.000   0.0000  24.000   31.7
B  29.80000  21.000  21.5000  59.700   43.1
C   0.00000   7.450   0.0000   0.000    0.0
D   0.00000   0.000   0.0000   4.630   10.1
E  19.40000   0.000  11.1000   0.000    0.0
F   0.00000   0.000  30.9000   0.000    0.0
G   0.00000   0.000   0.0000   0.000    0.0
H   0.00000   0.000   0.0000   0.000    0.0
X  22.66228  14.707  15.0639   9.236   13.9
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Brillinat, If instead of a fixed threshold, I'd like to keep for each samlpe the top N values (e.g. 3) and do the same collapsing, how would you do? – Andrea T. Oct 15 '20 at 09:15
  • Not super important, but the DF can have ~50-100 samples and > 1000 categories. – Andrea T. Oct 15 '20 at 09:25
  • @AndreaT. - Answer is edited, here is used [argsort in descending order](https://stackoverflow.com/a/16486305/2901002) – jezrael Oct 15 '20 at 09:38
  • I just noted that it's not working as expected, I probably wasn't clear here. Samp1 top three values are 29, 19 and 17, but it's printing 17, 10 and 2 in the example here. – Andrea T. Oct 15 '20 at 10:50
  • 1
    @AndreaT. - You are right, answer is edited also with link. – jezrael Oct 15 '20 at 11:25