how can I test for ordered subset

Question

firstly
I need to be able to test that 'abc' is an ordered subset of 'axbyc' and 'egd' is not an ordered subset of 'edg'. Another way to say it is that it is an ordered subset if I can remove specific characters of of one string and have it be equal to another.

secondly
I need to compare one pd.Series with another pd.Series to determine if the elements of one are ordered subsets of the corresponding element of the other.

consider the pd.Series s1 and s2

s1 = pd.Series(['abc', 'egd'])
s2 = pd.Series(['axbyc', 'edg'])

I need to compare them such that the results of the question
Are the elements of s1 ordered subsets of s2 equals

0     True
1    False
dtype: bool

What you call an "ordered subset" is usually called a [subsequence](https://en.wikipedia.org/wiki/Subsequence) in math. See [this rather elegant solution](http://stackoverflow.com/a/24017747) to the question [How to test if one string is a subsequence of another?](http://stackoverflow.com/q/24017363) — , Oct 19 '16 at 04:39
@friendlydog thank you, I knew there had to be a better name for it. I'm looking for better solutions than that. I have one already. I'm curious if others have better solutions. I'll post my answer to demonstrate. — piRSquared, Oct 19 '16 at 04:47

score 2 · Accepted Answer · edited Oct 19 '16 at 04:10

2

For the first part of the question:

def ordered_subset(s1, s2):
    s2 = iter(s2)
    try:
        for c in s1:
            while next(s2) != c:
                pass
        else:
            return True
    except StopIteration:
        return False

For the second part of the question:

pd.concat([s1, s2], axis=1).apply(lambda x: ordered_subset(*x), axis=1)

0     True
1    False
dtype: bool

edited Oct 19 '16 at 04:10

piRSquared

285,575
57
475
624

answered Oct 19 '16 at 04:04

Francisco

10,918
6
34
45

score 0 · Answer 2 · answered Oct 19 '16 at 04:51

0

use '.*'.join to create a regex pattern to match against sequence.

import re
import pandas as pd

s1 = pd.Series(['abc', 'egd'])
s2 = pd.Series(['axbyc', 'edg'])

match = lambda x: bool(re.match(*x))
pd.concat([s1.str.join('.*'), s2], axis=1).T.apply(match)

0     True
1    False
dtype: bool

Notice that

s1.str.join('.*')

0    a.*b.*c
1    e.*g.*d
Name: x, dtype: object

answered Oct 19 '16 at 04:51

piRSquared

285,575
57
475
624

1

Interesting solution! For a general subsequence test, I have two suggestions: 1) use `re.search` instead of `re.match`--or prepend `'.*'`--to the joined string in order to allow the two strings to start with different characters (`"abc"` is a subsequence of `"zaxbyc"`); 2) if your strings can contain more than just letters, `map` `re.escape` over the first string to avoid clashing with regex meta-characters (so for example, you can test if `"ab*"` is a subsequence of `"zaxby*"`). – Oct 19 '16 at 05:18
I wouldn't recommend using a regular expression like that with Python's backtracking implementation, consider: `re.match('.*'.join("a" * 30), "a" * 30)` – Francisco Oct 19 '16 at 05:38
@FranciscoCouzo I'm almost certainly going to abandon this approach for something else. I'm trying to vectorize it. – piRSquared Oct 19 '16 at 05:38

how can I test for ordered subset

2 Answers2