0

Python's str.split() method usually accepts a string as input.

For example,

inpoot = "fuzzy#wuzzy#was#a#bear"
outpoot = inpoot.split("#")
print(outpoot)
# ["fuzzy", "wuzzy", "was", "a", "bear"]

I want something like str.split() except that it will accept an arbitrary regular expression as input.

For something simpler than what I want, we could have a predicate.
The predicate would be a function which:

  • accepts single a character as input
  • will output a Boolean True or False

If the predicate outputs True when evaluated on input ch, then we consider ch to be a delimiter.

  • ch will be omitted from the output.
  • the input string will be split at whatever index ch was located at.
def pred_split(stryng:str, is_delim):
    # PRECONDITION:
    #
    #     is_delim.hasattr("__call__") == True
    #
    buffer = list()
    out = list()
    for ch in stryng:
        if pred(ch):
            out.append("".join(buffer))
            buffer = list()
        else:
            buffer.append(ch)
    return out

However, the code shown above only works for single-character delimiters.

Suppose that "::" is a delimiter, but a single-colon ":" is not a delimiter. In that case, inpoot.split("::") would work. However, my point is that delimiters are often more complicated that single-characters.

We want to split anytime we encounter a string from a set of strings.

import itertools as itts

class MultiSplit:
    itts = itts

    Sentinel = type("Sentinel", tuple(), dict())
    Sentinel = Sentinel() # singleton class  

    @classmethod
    def leaf_iter(cls, lyst)
        raise NotImplementedError()

    @classmethod
    def __call__(cls, stryng:str, delims):
        """
        SOME PRECONDITIONS:
            hasattr(delims, "__iter__")

            for s in iter(delims):
                issubclass(s, str)
        """   
        delims = cls.itts.chain(iter(delim_strs), iter([cls.Sentinel]))
        tree = [stryng]
        for delim in delims:
            if not (delim is cls.Sentinel):
                for idx, leaf in cls.leaf_iter(tree):
                    tree[idx] = tree[idx].split(delim)
        
        return list(cls.leaf_iter(tree))

MultiSplit = MultiSplit()

input = "a::-b-end_c_start-d"
delim_strs = ["::", "end", "start"]

output = MultiSplit(input, delim_strs_
output == ["a", "-b-", "_c_", "-d"]   

We still don't have quite what we want.
There are several problems, only one of which is that a list of strings is not exactly a regular expression.

Toothpick Anemone
  • 4,290
  • 2
  • 20
  • 42
  • 7
    the regex module has a function for this: `re.split`. – kwkt Jan 10 '21 at 17:25
  • Does this answer your question? [Split Strings into words with multiple word boundary delimiters](https://stackoverflow.com/questions/1059559/split-strings-into-words-with-multiple-word-boundary-delimiters) – mkrieger1 Jan 10 '21 at 17:35

0 Answers0