How to select rows in a DataFrame between two values, in Python Pandas?

Question

I am trying to modify a DataFrame df to only contain rows for which the values in the column closing_price are between 99 and 101 and trying to do this with the code below.

However, I get the error

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

and I am wondering if there is a way to do this without using loops.

df = df[(99 <= df['closing_price'] <= 101)]

The issue here is that you can't compare a scalar with an array hence the error, for comparisons you have to use the bitwise operators and enclose them in parentheses due to operator precedence — EdChum, Jul 24 '15 at 20:21
`df.query` and `pd.eval` seem like good fits for this use case. For information on the `pd.eval()` family of functions, their features and use cases, please visit [Dynamic Expression Evaluation in pandas using pd.eval()](https://stackoverflow.com/questions/53779986/dynamic-expression-evaluation-in-pandas-using-pd-eval). — cs95, Dec 16 '18 at 04:57

score 334 · Answer 1 · edited Aug 15 '19 at 09:23

334

Consider also series between:

df = df[df['closing_price'].between(99, 101)]

edited Aug 15 '19 at 09:23

iacob

20,084
6
92
119

answered Nov 05 '16 at 20:18

Parfait

104,375
17
94
125

3

Is there "not between" functionality in pandas? I am not finding it. – dsugasa Apr 23 '19 at 10:16
8

@dsugasa, use the [tilde operator](https://stackoverflow.com/q/46054318/1422451) with `between`. – Parfait Apr 23 '19 at 12:32
11

@dsugasa e.g. `df = df[~df['closing_price'].between(99, 101)]` – Jan33 Dec 03 '19 at 08:46
1

Is there a possibility where we could use `.between()` within `.query()` ?? I am curious to know that. – Manoj Kumar Mar 26 '21 at 21:06

score 167 · Accepted Answer · answered Jul 24 '15 at 19:04

167

You should use () to group your boolean vector to remove ambiguity.

df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]

answered Jul 24 '15 at 19:04

Jianxun Li

24,004
10
58
76

MaxU - stand with Ukraine · Answer 3 · 2017-08-21T18:36:45.060

35

there is a nicer alternative - use query() method:

In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})

In [59]: df
Out[59]:
   closing_price
0            104
1             99
2             98
3             95
4            103
5            101
6            101
7             99
8             95
9             96

In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
   closing_price
1             99
5            101
6            101
7             99

UPDATE: answering the comment:

I like the syntax here but fell down when trying to combine with expresison; df.query('(mean + 2 *sd) <= closing_price <=(mean + 2 *sd)')

In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
     ...:       " <= closing_price <= " + \
     ...:       "(closing_price.mean() + 2*closing_price.std())"
     ...:

In [162]: df.query(qry)
Out[162]:
   closing_price
0             97
1            101
2             97
3             95
4            100
5             99
6            100
7            101
8             99
9             95

edited Aug 21 '17 at 18:36

answered Aug 11 '16 at 07:07

MaxU - stand with Ukraine

205,989
36
386
419

I like the syntax here but fell down when trying to combine with expresison; df.query('(mean + 2 *sd) <= closing_price <=(mean + 2 *sd)') – mapping dom Aug 21 '17 at 11:42
1

@mappingdom, what is `mean` and `sd`? Are those column names? – MaxU - stand with Ukraine Aug 21 '17 at 12:38
no they are the calculated mean and standard deviation stored as a float – mapping dom Aug 21 '17 at 15:13
@mappingdom, what you mean saying "stored"? – MaxU - stand with Ukraine Aug 21 '17 at 16:06
@mappingdom, i've updated my post - is that what you were asking for? – MaxU - stand with Ukraine Aug 21 '17 at 18:32
This way doesn't work if we have a name like 'A name full of spaces spaces'. – Mai Hai Feb 19 '21 at 07:29
@MaiHai, it does if you use it [correctly](https://github.com/pandas-dev/pandas/issues/6508) – MaxU - stand with Ukraine Feb 19 '21 at 07:47
Is there a possibility where we could use `.between()` within `.query()` ?? I really would love to know that. – Manoj Kumar Mar 26 '21 at 21:05
@ManojKumar, give it a try) – MaxU - stand with Ukraine Mar 26 '21 at 21:15
@MaxU - it threw an error when I tired `df.query('closing_price.between(99, 101, inclusive=True)')` – Manoj Kumar Mar 26 '21 at 21:19
2

@ManojKumar, `df.query('closing_price.between(99, 101, inclusive=True)', engine="python")` - but this will be slower compared to "numexpr" engine. – MaxU - stand with Ukraine Mar 26 '21 at 23:00
@MaxU - Aaawww! I forgot engine... thanks for this :) – Manoj Kumar Mar 27 '21 at 12:49

score 11 · Answer 4 · answered Aug 22 '17 at 16:40

11

newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')

or

mean = closing_price.mean()
std = closing_price.std()

newdf = df.query('@mean <= closing_price <= @std')

answered Aug 22 '17 at 16:40

crashMOGWAI

619
1
5
23

I wonder if we can use `.between()` within `.query()` ?? – Manoj Kumar Mar 26 '21 at 21:03

normanius · Answer 5 · 2020-11-25T02:19:46.073

If one has to call pd.Series.between(l,r) repeatedly (for different bounds l and r), a lot of work is repeated unnecessarily. In this case, it's beneficial to sort the frame/series once and then use pd.Series.searchsorted(). I measured a speedup of up to 25x, see below.

def between_indices(x, lower, upper, inclusive=True):
    """
    Returns smallest and largest index i for which holds 
    lower <= x[i] <= upper, under the assumption that x is sorted.
    """
    i = x.searchsorted(lower, side="left" if inclusive else "right")
    j = x.searchsorted(upper, side="right" if inclusive else "left")
    return i, j

# Sort x once before repeated calls of between()
x = x.sort_values().reset_index(drop=True)
# x = x.sort_values(ignore_index=True) # for pandas>=1.0
ret1 = between_indices(x, lower=0.1, upper=0.9)
ret2 = between_indices(x, lower=0.2, upper=0.8)
ret3 = ...

Benchmark

Measure repeated evaluations (n_reps=100) of pd.Series.between() as well as the method based on pd.Series.searchsorted(), for different arguments lower and upper. On my MacBook Pro 2015 with Python v3.8.0 and Pandas v1.0.3, the below code results in the following outpu

# pd.Series.searchsorted()
# 5.87 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pd.Series.between(lower, upper)
# 155 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Logical expressions: (x>=lower) & (x<=upper)
# 153 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

import numpy as np
import pandas as pd

def between_indices(x, lower, upper, inclusive=True):
    # Assumption: x is sorted.
    i = x.searchsorted(lower, side="left" if inclusive else "right")
    j = x.searchsorted(upper, side="right" if inclusive else "left")
    return i, j

def between_fast(x, lower, upper, inclusive=True):
    """
    Equivalent to pd.Series.between() under the assumption that x is sorted.
    """
    i, j = between_indices(x, lower, upper, inclusive)
    if True:
        return x.iloc[i:j]
    else:
        # Mask creation is slow.
        mask = np.zeros_like(x, dtype=bool)
        mask[i:j] = True
        mask = pd.Series(mask, index=x.index)
        return x[mask]

def between(x, lower, upper, inclusive=True):
    mask = x.between(lower, upper, inclusive=inclusive)
    return x[mask]

def between_expr(x, lower, upper, inclusive=True):
    if inclusive:
        mask = (x>=lower) & (x<=upper)
    else:
        mask = (x>lower) & (x<upper)
    return x[mask]

def benchmark(func, x, lowers, uppers):
    for l,u in zip(lowers, uppers):
        func(x,lower=l,upper=u)

n_samples = 1000
n_reps = 100
x = pd.Series(np.random.randn(n_samples))
# Sort the Series.
# For pandas>=1.0:
# x = x.sort_values(ignore_index=True)
x = x.sort_values().reset_index(drop=True)

# Assert equivalence of different methods.
assert(between_fast(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_expr(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_fast(x, 0, 1, False).equals(between(x, 0, 1, False)))
assert(between_expr(x, 0, 1, False).equals(between(x, 0, 1, False)))

# Benchmark repeated evaluations of between().
uppers = np.linspace(0, 3, n_reps)
lowers = -uppers
%timeit benchmark(between_fast, x, lowers, uppers)
%timeit benchmark(between, x, lowers, uppers)
%timeit benchmark(between_expr, x, lowers, uppers)

score 5 · Answer 6 · answered Feb 06 '19 at 01:06

If you're dealing with multiple values and multiple inputs you could also set up an apply function like this. In this case filtering a dataframe for GPS locations that fall withing certain ranges.

def filter_values(lat,lon):
    if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
        return True
    elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
        return True
    else:
        return False


df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]

score 4 · Answer 7 · answered Dec 05 '18 at 14:33

Instead of this

df = df[(99 <= df['closing_price'] <= 101)]

You should use this

df = df[(df['closing_price']>=99 ) & (df['closing_price']<=101)]

We have to use NumPy's bitwise Logic operators |, &, ~, ^ for compounding queries. Also, the parentheses are important for operator precedence.

For more info, you can visit the link :Comparisons, Masks, and Boolean Logic

How to select rows in a DataFrame between two values, in Python Pandas?

7 Answers7

Linked

Related