2

I have a DataFrame like below:

df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
                   "list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})

I want to filter this DataFrame such that it only contains the row in which X is a subsequence (so the order of the elements in X are the same as in the list and they not interleaved by other elements in the list) of the list column.

for example, if X = [6, 8, 3], I want the output to look like this:

id    list
1     [2, 51, 6, 8, 3]
3     [6, 8, 3, 9, 10, 11]

I know I can check if a list is a subsequence of another list with the following function (found on How to check subsequence exists in a list?):

def x_in_y(query, base):
    l = len(query)

    for i in range(len(base) - l + 1):
        if base[i:i+l] == query:
            return True
    return False

I have two questions:

Question 1:

How to apply this to a Pandas DataFrame column like in my example?

Question 2:

Is this the most efficient way to do this? And if not, what would be? The function does not look that elegant/Pythonic, and I have to apply it to a very big DataFrame of about 200K rows.

[Note: the elements of the lists in the list column are unique, should that help to optimize things]

Peter
  • 722
  • 6
  • 24

4 Answers4

2

Here is solution call function for column:

df = df[df.list.map(lambda x: x_in_y(X, x))]
#alternative
#df = df[df.list.apply(lambda x: x_in_y(X, x))]
print (df)
   id                  list
0   1      [2, 51, 6, 8, 3]
2   3  [6, 8, 3, 9, 10, 11]

Performance is really good in sample data, in real the best test too:

#200k rows
df = pd.concat([df] * 40000, ignore_index=True)
print (df)

X = [6, 8, 3]
x = to_string([6, 8, 3])


In [166]: %timeit df.list.map(lambda x: x_in_y(X, x))
214 ms ± 6.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [167]: %timeit df['list'].map(to_string).str.contains(x)
413 ms ± 4.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [168]: %timeit df["list"].apply(has_subsequence, subseq=X)
5.2 s ± 420 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [169]: %timeit df.list.apply(lambda y: ''.join(map(str,X)) in ''.join(map(str,y)))
573 ms ± 116 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 3
    I need the subset to be in the same order, thats why I said "subsequence" instead of simply "subset" – Peter May 26 '21 at 12:53
  • 1
    @Peter - You are right, answer was edited. Also added timings, seems this solution is fastest. – jezrael May 27 '21 at 08:18
2

Using the numpy rolling-window technique:

import numpy as np

def rolling_window(a, size):
    a = np.array(a)
    shape = a.shape[:-1] + (a.shape[-1] - size + 1, size)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def has_subsequence(a, subseq):
    return (rolling_window(a, len(subseq)) == subseq).all(axis=1).any()

mask = df["list"].apply(has_subsequence, subseq=[6,8,3])
df[mask]

Explanation:

rolling_window creates a view into the array with the given shape and strides:

>>> rolling_window([1,2,3,4], 2)
np.array([[1,2], [2,3], [3,4]])

Then we compare the result with our target X

>>> np.array([[1,2], [2,3], [3,4]]) == [2,3]
np.array([[False, False], [True, True], [False, False]])

Then we tell to numpy to return True in cases when the all items are True over the 1st axis.

>>> np.array([[False, False], [True, True], [False, False]]).all(axis=1)
np.array([False, True, False])

And finally return True if there are any True in the array.

>>> np.array([False, True, False]).any()
Alexander Volkovsky
  • 2,588
  • 7
  • 13
2

You can try this:

import pandas as pd

def to_string(l):
    return '-' + '-'.join(map(str, l)) + '-'

X = to_string([6, 8, 3])
df = pd.DataFrame({"id": [1, 2, 3, 4, 5], "list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})

df[df['list'].map(to_string).str.contains(X)]

#    id                  list
# 0   1      [2, 51, 6, 8, 3]
# 2   3  [6, 8, 3, 9, 10, 11]

In my oppinion it is important that you add delimiters to start and end of the string. Otherwise you will have problems with lists such as: [666, 8, 3]

Andreas
  • 8,694
  • 3
  • 14
  • 38
1

You can try this:

x = [6, 8, 3]
df = df.loc[df.list.apply(lambda y: ''.join(map(str,x)) in ''.join(map(str,y)))]

OUTPUT:

   id                  list
0   1      [2, 51, 6, 8, 3]
2   3  [6, 8, 3, 9, 10, 11]
Nk03
  • 14,699
  • 2
  • 8
  • 22