Filter rows based on multiple columns entries

Question

I have a dataframe which contains millions of entries and looks something like this:

Chr	Start	Alt
1	21651521	A
1	41681521	T
1	41681521	T
...	...	...
X	423565	T

I am currently trying to count the number of rows that match several conditions at the same time, i.e. Chr==1, Start==41681521 and Alt==T. Right now I am using this syntax, which works fine, but seems unpythonic and is also rather slow I think.

num_occurrence = sum((df["Chr"] == chrom) & 
                      (df["Start"] == int(position)) & 
                      (df["Alt"] == allele))

Does anyone have an approach which is more suitable then mine? Any help is much appreciated!

Cheers!

score 1 · Answer 1 · answered Oct 12 '21 at 15:44

1

Use DataFrame.all + Series.sum:

res = (df[["Chr", "Start", "Alt"]] == [chrom, int(position), allele]).all(1).sum()

For example:

import pandas as pd

# toy data
df = pd.DataFrame(data=[[1, 21651521, "A"], [1, 41681521, "T"], [1, 41681521, "T"]], columns=["Chr", "Start", "Alt"])
chrom, position, allele = 1, "21651521", "A"


res = (df[["Chr", "Start", "Alt"]] == [chrom, int(position), allele]).all(1).sum()
print(res)

Output

answered Oct 12 '21 at 15:44

Dani Mesejo

61,499
6
49
76

This is indeed more readable, but for my data it is actually slower than my proposed option. Do you know why? The idea seems to be the same. – nhaus Oct 12 '21 at 17:38
@nhaus What about `((df["Chr"] == chrom) & (df["Start"] == int(position)) & (df["Alt"] == allele)).sum()`? – Dani Mesejo Oct 12 '21 at 17:40
Probably is slower because comparing multiple columns at the same time and them checking all are True can somehow be slower – Dani Mesejo Oct 12 '21 at 17:42

normanius · Accepted Answer · 2021-10-13T09:43:38.453

Alternative 1: `pd.DataFrame.query()`

You could work with query (see also the illustrative examples here):

expr = "Chr=={chr} & Start=={pos} & Alt=='{alt}'"
ret = df.query(expr.format(chr=chrom, pos=int(position), alt=allele))

In my experiments, this led already to a considerable speedup.

Optimizing this further requires additional information about the data types involved. There are several things you could try:

Alternative 2: Query sorted data

If you can afford to sort your DataFrame prior to querying, you can use pd.Series.searchsorted(). Here is a possible approach:

def query_sorted(df, chrom, position, allele):
    """
    Returns index of the matches.
    """
    assert df["Start"].is_monotonic_increasing
    i_min, i_max = df["Start"].searchsorted([position, position+1])
    df = df.iloc[i_min:i_max]
    return df[(df["Chr"] == chrom) & (df["Alt"] == allele)].index

# Usage: first sort df by column "Start", then query:
df = df.sort_values("Start")
ret_index = query_sorted(df, chrom, position, allele)
print(len(ret_index))

Alternative 3: Use hashes

Another idea would be to use hashes. Again, this requires some calculations up front, but it speeds up the query considerably. Here is an example based on pd.util.hash_pandas_object():

def query_hash(df, chrom, position, allele):
    """
    Returns a view on df
    """
    assert "hash" in df
    dummy = pd.DataFrame([[chrom, position, allele]])
    query_hash = pd.util.hash_pandas_object(dummy, index=False).squeeze()
    return df[df["hash"] == query_hash].index

# Usage: first compute hashes over the columns of interest, then query
df["hash"] = pd.util.hash_pandas_object(df[["Chr", "Start", "Alt"]], 
                                        index=False)
ret_index = query_hash(df, chrom, position, allele)
print(len(ret_index))

Alternative 4: Use a multi-index

Pandas also operates with hashes when accessing rows via the index. Thus, instead of calculating hashes explicitly, as in the previous alternative, one could simply set the index of the DataFrame prior to querying. (Since setting all columns as index would result in an empty DataFrame, I first create a dummy column. For a real DataFrame with additional columns this will probably not be necessary.)

df["dummy"] = None
df = df.set_index(["Chr", "Start", "Alt"])
df = df.sort_index()  # Improves performance
print(len(df.loc[(chrom, position, allele)])
# Interestingly, chaining .loc[] is about twice as fast
print(len(df.loc[chrom].loc[position].loc[allele]))

Note that using an index where one index value maps to many records is not always a good idea. Also, this approach is slower than alternative 3, indicating that Pandas does some extra work here.

There are certainly many more ways to improve this, though the alternative approaches will depend on your specific needs.

Results

I tested with n=10M samples on a MacBook Pro (Mid 2015), running Python 3.8, Pandas 1.2.4 and IPython 7.24.1. Note that the performance evaluation depends on the problem size. The relative assessment of the methods therefore will change for different problem sizes.

# original (sum(s)):  1642.0 ms ± 19.1 ms
# original (s.sum()):  639.0 ms ± 21.9 ms
# query():             175.0 ms ±  1.1 ms 
# query_sorted():       17.5 ms ± 60.4 µs
# query-hash():         10.6 ms ± 62.5 µs
# multi-index:          71.5 ms ±  0.7 ms 
# multi-index (seq.):   36.5 ms ±  0.6 ms

Implementation

This is how I constructed the data and compared the different approaches.

import numpy as np 
import pandas as pd

# Create test data
n = int(10*1e6)
df = pd.DataFrame({"Chr": np.random.randint(1,23+1,n),
                   "Start": np.random.randint(100,999, n),
                   "Alt": np.random.choice(list("ACTG"), n)})

# Query point
chrom, position, allele = 1, 142, "A"

# Create test data
n = 10000000
df = pd.DataFrame({"Chr": np.random.randint(1,23+1,n),
                   "Start": np.random.randint(100,999, n),
                   "Alt": np.random.choice(list("ACTG"), n)})

# Query point
chrom, position, allele = 1, 142, "A"

# Measure performance in IPython
print("original (sum(s)):")
%timeit sum((df["Chr"] == chrom) & \
            (df["Start"] == int(position)) & \
            (df["Alt"] == allele))

print("original (s.sum()):")
%timeit ((df["Chr"] == chrom) & \
         (df["Start"] == int(position)) & \
         (df["Alt"] == allele)).sum()

print("query():")
%timeit len(df.query(expr.format(chr=chrom, \
                                 pos=position, \
                                 alt=allele)))

print("query_sorted():")
df_sorted = df.sort_values("Start")
%timeit query_sorted(df_sorted, chrom, position, allele)

print("query-hash():")
df_hash = df.copy()
df_hash["hash"] = pd.util.hash_pandas_object(df_hash[["Chr", "Start", "Alt"]], 
                                             index=False)
%timeit query_hash(df_hash, chrom, position, allele)

print("multi-index:")
df_multi = df.copy()
df_multi["dummy"] = None
df_multi = df_multi.set_index(["Chr", "Start", "Alt"]).sort_index()
%timeit df_multi.loc[(chrom, position, allele)]
print("multi-index (seq.):")
%timeit len(df_multi.loc[chrom].loc[position].loc[allele])

This is an amazing answer! Thank you very much. It also helped me to see how I can use hashes to accomplish that! I just have one question. For option 3 (Use hashes), you checked the number of overlaps with something like this: `len(df[df["hash"] == query_hash].index)`. Is there a specific reason for that? It feels like its simpler and more readable to do this: `(df["hash"] == query_hash).sum()` — nhaus, Oct 13 '21 at 08:09
@nhaus `ret = df[mask]` returns a [view](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-view-versus-copy). Calling `len(ret)` or `len(ret.index)` is equivalent to reading a property of the view (constant time, negligible). Creating a view is highly optimized and faster than summing over (a very long) mask. For my large test dataset, the difference was 30-40%. Note that this ratio will change with the length of the dataframe, in favor of `sum()`, as creating a view comes with (constant-time) overhead, which does not pay off for smaller DataFrames. — normanius, Oct 13 '21 at 08:51
@nhaus This answer is driven by curiosity... I've added another alternative based on a multi-index. :) I just noted that many of the aspects I've addressed here have been covered already in this [SO post](https://stackoverflow.com/questions/17071871/) as well. — normanius, Oct 13 '21 at 09:48