Min value in each column of a data frame excluding zeros

Question

The original dataframe is a table like this:

                        S1_r1_ctrl/     S1_r2_ctrl/     S1_r3_ctrl/
sp|P38646|GRP75_HUMAN   2.960000e-06    5.680000e-06    0.000000e+00
sp|O75694-2|NU155_HUMAN 2.710000e-07    0.000000e+00    2.180000e-07
sp|Q05397-2|FAK1_HUMAN  0.000000e+00    2.380000e-07    7.330000e-06
sp|O60671-2|RAD1_HUMAN  NaN             NaN             NaN

I am looking for the smallest value in each column of a dataframe greater than zero. I was trying to use this example to answer my question. My code looks like:

df.ne(0).idxmin().to_frame('pos').assign(value=lambda d: df.lookup(d.pos, d.index))

but still I get only zeros and my result looks like this:

            pos                     value

S1_r1_ctrl/ sp|Q05397-2|FAK1_HUMAN  0.0
S1_r2_ctrl/ sp|O75694-2|NU155_HUMAN 0.0
S1_r3_ctrl/ sp|P38646|GRP75_HUMAN   0.0

instead of this:

            pos                     value
S1_r1_ctrl/ sp|O75694-2|NU155_HUMAN 2.710000e-07
S1_r2_ctrl/ sp|Q05397-2|FAK1_HUMAN  2.380000e-07
S1_r3_ctrl/ sp|O75694-2|NU155_HUMAN 2.180000e-07

I guess there might be a problem in data types but I'm not sure. I assumed ne(0) would ignore zeros but it doesn't so I am confused why. And perhaps there's a more intelligent way to find what I need.

Possible duplicate of [Extracting the n-th maximum/minimum value in Pandas DataFrame](https://stackoverflow.com/questions/48026397/extracting-the-n-th-maximum-minimum-value-in-pandas-dataframe) — Rushabh Mehta, Jul 23 '18 at 15:47

user3483203 · Accepted Answer · 2018-07-23T15:47:44.503

8

Setup

df = pd.DataFrame([[0, 0, 0],
                   [0, 10, 0],
                   [4, 0, 0],
                   [1, 2, 3]],
                  columns=['first', 'second', 'third'])

Using a mask with min(0):

df[df.gt(0)].min(0)

first     1.0
second    2.0
third     3.0
dtype: float64

As @DSM pointed out, this can also be written:

df.where(df.gt(0)).min(0)

Performance

def chris():
    df1[df1.gt(0)].min(0)

def chris2():
    df1.where(df1.gt(0)).min(0)

def wen():
    a=df1.values.T
    a = np.ma.masked_equal(a, 0.0, copy=False)
    a.min(1)

def haleemur():
    df1.replace(0, np.nan).min()

Setup

from timeit import timeit
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['chris', 'chris2', 'wen', 'haleemur'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        df1 = df.copy()
        df1 = pd.concat([df1]*c)
        stmt = '{}()'.format(f)
        setp = 'from __main__ import df1, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

Results

edited Jul 23 '18 at 15:47

answered Jul 23 '18 at 15:09

user3483203

50,081
9
65
94

1

I'd have used `df.where` instead, but this is the right approach. – DSM Jul 23 '18 at 15:10
Thanks for the tip! What is the benefit of `where` here? – user3483203 Jul 23 '18 at 15:11
This is way better than my approach. +1 – harvpan Jul 23 '18 at 15:11
1

Really only semantically, as I think it does a better job of expressing the intended mask. – DSM Jul 23 '18 at 15:13
Thats the way to go; its better to explore the axis rather than list comprehension through all columns – rafaelc Jul 23 '18 at 15:16
Guys could you please explain what df.gt() is? I can't find any clear documentation for that – plnnvkv Jul 24 '18 at 12:24
1

Just stands for "greater than". `df.gt(0)` is the same as `df > 0` – user3483203 Jul 24 '18 at 14:18

score 5 · Answer 2 · answered Jul 23 '18 at 15:31

5

Maybe numpy is good alternative

a=df.values.T
a = np.ma.masked_equal(a, 0.0, copy=False)
a.min(1)
Out[755]: 
masked_array(data=[1, 2, 3],
             mask=[False, False, False],
       fill_value=999999,
            dtype=int64)

answered Jul 23 '18 at 15:31

BENY

317,841
20
164
234

harvpan · Answer 3 · 2018-07-23T15:05:49.677

4

You need to loop throgh all columns and find the min of series without 0s.

df = pd.DataFrame([[0, 0, 0],
                   [0, 10, 0],
                   [4, 0, 0],
                   [1, 2, 3]],
                  columns=['first', 'second', 'third'])

[df[col][df[col].ne(0)].min() for col in df.columns]

Output:

[1, 2, 3]

edited Jul 23 '18 at 15:05

answered Jul 23 '18 at 15:04

harvpan

8,571
2
18
36

What is this output? – rafaelc Jul 23 '18 at 15:05
1

@RafaelC, the OP had a link to another SO question with minimal input. I have added that to the answer to avoid confusion. – harvpan Jul 23 '18 at 15:06

Haleemur Ali · Answer 4 · 2018-07-23T15:23:49.577

4

Another option would be to replace 0 with np.nan and then apply the min method.

note: this doesn't address the > 0 condidtion, but the test frame seems to be only non-negative values.

using the same setup as others:

df = pd.DataFrame([[0, 0, 0],
                   [0, 10, 0],
                   [4, 0, 0],
                   [1, 2, 3]],
                  columns=['first', 'second', 'third'])

df.replace(0, np.nan).min()

first     1.0
second    2.0
third     3.0
dtype: float64

posting this alternative as i found it to be slightly faster than the excellent answer by user3483203, which was also my first instinct at this problem

%timeit df.replace(0, np.nan).min()
745 µs ± 2.72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit df[df > 0].min()
1.09 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

note also that:

%timeit df[df != 0].min()
1.1 ms ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

So, if one needs to ignore a specific value when calculating the aggregate, rather than a range, replace with np.nan seems to be the performant way to go

edited Jul 23 '18 at 15:23

answered Jul 23 '18 at 15:19

Haleemur Ali

26,718
5
61
85

Can't prove it right now, but not sure if this is still faster for big df's. Would guess `df.gt(0)` gets faster – rafaelc Jul 23 '18 at 15:23
yup, entirely possible, i'll run the benchmarks and post the chart shortly, as the performance profile could be useful to people in general – Haleemur Ali Jul 23 '18 at 15:24
Could you please adding the timing for my method , just curious about it ..thank you also , vote for your additional perfect work – BENY Jul 23 '18 at 15:31
@Wen, absolutely will add timing for your solution:) creating the test-setup right now. – Haleemur Ali Jul 23 '18 at 15:40
2

@Wen added to my answer, yours wins at every dataframe size. – user3483203 Jul 23 '18 at 15:42

score 0 · Answer 5 · answered Jul 23 '18 at 15:14

0

Try for each column:

    df.value.min(skipna=True)

answered Jul 23 '18 at 15:14

Denis

306
1
3
10

1

skipna=True is the default, and this doesn't address the > 0 condition, because zero and negative values aren't null. – DSM Jul 23 '18 at 15:17

Min value in each column of a data frame excluding zeros

5 Answers5