How to filter out rows in a Pandas DataFrame that contain a specific subsequence in a list-column?

Question

I have a DataFrame like below:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
                   "list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})

I want to filter this DataFrame such that it only contains the row in which X is a subsequence (so the order of the elements in X are the same as in the list and they not interleaved by other elements in the list) of the list column.

for example, if X = [6, 8, 3], I want the output to look like this:

id    list
1     [2, 51, 6, 8, 3]
3     [6, 8, 3, 9, 10, 11]

I know I can check if a list is a subsequence of another list with the following function (found on How to check subsequence exists in a list?):

def x_in_y(query, base):
    l = len(query)

    for i in range(len(base) - l + 1):
        if base[i:i+l] == query:
            return True
    return False

I have two questions:

Question 1:

How to apply this to a Pandas DataFrame column like in my example?

Question 2:

Is this the most efficient way to do this? And if not, what would be? The function does not look that elegant/Pythonic, and I have to apply it to a very big DataFrame of about 200K rows.

[Note: the elements of the lists in the list column are unique, should that help to optimize things]

jezrael · Accepted Answer · 2021-05-27T08:17:49.573

Here is solution call function for column:

df = df[df.list.map(lambda x: x_in_y(X, x))]
#alternative
#df = df[df.list.apply(lambda x: x_in_y(X, x))]
print (df)
   id                  list
0   1      [2, 51, 6, 8, 3]
2   3  [6, 8, 3, 9, 10, 11]

Performance is really good in sample data, in real the best test too:

#200k rows
df = pd.concat([df] * 40000, ignore_index=True)
print (df)

X = [6, 8, 3]
x = to_string([6, 8, 3])


In [166]: %timeit df.list.map(lambda x: x_in_y(X, x))
214 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [167]: %timeit df['list'].map(to_string).str.contains(x)
413 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [168]: %timeit df["list"].apply(has_subsequence, subseq=X)
5.2 s ± 420 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [169]: %timeit df.list.apply(lambda y: ''.join(map(str,X)) in ''.join(map(str,y)))
573 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I need the subset to be in the same order, thats why I said "subsequence" instead of simply "subset" — Peter, May 26 '21 at 12:53
@Peter - You are right, answer was edited. Also added timings, seems this solution is fastest. — jezrael, May 27 '21 at 08:18

Alexander Volkovsky · Answer 2 · 2021-05-26T13:44:15.600

Using the numpy rolling-window technique:

import numpy as np

def rolling_window(a, size):
    a = np.array(a)
    shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def has_subsequence(a, subseq):
    return (rolling_window(a, len(subseq)) == subseq).all(axis=1).any()

mask = df["list"].apply(has_subsequence, subseq=[6,8,3])
df[mask]

Explanation:

rolling_window creates a view into the array with the given shape and strides:

>>> rolling_window([1,2,3,4], 2)
np.array([[1,2], [2,3], [3,4]])

Then we compare the result with our target X

>>> np.array([[1,2], [2,3], [3,4]]) == [2,3]
np.array([[False, False], [True, True], [False, False]])

Then we tell to numpy to return True in cases when the all items are True over the 1st axis.

>>> np.array([[False, False], [True, True], [False, False]]).all(axis=1)
np.array([False, True, False])

And finally return True if there are any True in the array.

>>> np.array([False, True, False]).any()

Andreas · Answer 3 · 2021-05-26T13:53:48.093

You can try this:

import pandas as pd

def to_string(l):
    return '-' + '-'.join(map(str, l)) + '-'

X = to_string([6, 8, 3])
df = pd.DataFrame({"id": [1, 2, 3, 4, 5], "list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})

df[df['list'].map(to_string).str.contains(X)]

#    id                  list
# 0   1      [2, 51, 6, 8, 3]
# 2   3  [6, 8, 3, 9, 10, 11]

In my oppinion it is important that you add delimiters to start and end of the string. Otherwise you will have problems with lists such as: [666, 8, 3]

score 1 · Answer 4 · answered May 26 '21 at 13:04

1

You can try this:

x = [6, 8, 3]
df = df.loc[df.list.apply(lambda y: ''.join(map(str,x)) in ''.join(map(str,y)))]

OUTPUT:

   id                  list
0   1      [2, 51, 6, 8, 3]
2   3  [6, 8, 3, 9, 10, 11]

answered May 26 '21 at 13:04

Nk03

14,699
2
8
22

1

Ya, thinkg about this, but not sure if better performance – jezrael May 26 '21 at 13:04
It was the 1st thing that came to my mind. Not so sure about the performance. @jezrael – Nk03 May 26 '21 at 13:06

How to filter out rows in a Pandas DataFrame that contain a specific subsequence in a list-column?

4 Answers4