I have a DataFrame like below:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5],
"list": [[2, 51, 6, 8, 3], [19, 2, 11, 9], [6, 8, 3, 9, 10, 11], [4, 5], [8, 3, 9, 6]]})
I want to filter this DataFrame such that it only contains the row in which X
is a subsequence (so the order of the elements in X are the same as in the list and they not interleaved by other elements in the list) of the list
column.
for example, if X = [6, 8, 3]
, I want the output to look like this:
id list
1 [2, 51, 6, 8, 3]
3 [6, 8, 3, 9, 10, 11]
I know I can check if a list is a subsequence of another list with the following function (found on How to check subsequence exists in a list?):
def x_in_y(query, base):
l = len(query)
for i in range(len(base) - l + 1):
if base[i:i+l] == query:
return True
return False
I have two questions:
Question 1:
How to apply this to a Pandas DataFrame column like in my example?
Question 2:
Is this the most efficient way to do this? And if not, what would be? The function does not look that elegant/Pythonic, and I have to apply it to a very big DataFrame of about 200K rows.
[Note: the elements of the lists in the list
column are unique, should that help to optimize things]