Python's str.split()
method usually accepts a string as input.
For example,
inpoot = "fuzzy#wuzzy#was#a#bear"
outpoot = inpoot.split("#")
print(outpoot)
# ["fuzzy", "wuzzy", "was", "a", "bear"]
I want something like str.split()
except that it will accept an arbitrary regular expression as input.
For something simpler than what I want, we could have a predicate.
The predicate would be a function which:
- accepts single a character as input
- will output a Boolean
True
orFalse
If the predicate outputs True
when evaluated on input ch
, then we consider ch
to be a delimiter.
ch
will be omitted from the output.- the input string will be split at whatever index
ch
was located at.
def pred_split(stryng:str, is_delim):
# PRECONDITION:
#
# is_delim.hasattr("__call__") == True
#
buffer = list()
out = list()
for ch in stryng:
if pred(ch):
out.append("".join(buffer))
buffer = list()
else:
buffer.append(ch)
return out
However, the code shown above only works for single-character delimiters.
Suppose that "::"
is a delimiter, but a single-colon ":"
is not a delimiter. In that case, inpoot.split("::")
would work. However, my point is that delimiters are often more complicated that single-characters.
We want to split anytime we encounter a string from a set of strings.
import itertools as itts
class MultiSplit:
itts = itts
Sentinel = type("Sentinel", tuple(), dict())
Sentinel = Sentinel() # singleton class
@classmethod
def leaf_iter(cls, lyst)
raise NotImplementedError()
@classmethod
def __call__(cls, stryng:str, delims):
"""
SOME PRECONDITIONS:
hasattr(delims, "__iter__")
for s in iter(delims):
issubclass(s, str)
"""
delims = cls.itts.chain(iter(delim_strs), iter([cls.Sentinel]))
tree = [stryng]
for delim in delims:
if not (delim is cls.Sentinel):
for idx, leaf in cls.leaf_iter(tree):
tree[idx] = tree[idx].split(delim)
return list(cls.leaf_iter(tree))
MultiSplit = MultiSplit()
input = "a::-b-end_c_start-d"
delim_strs = ["::", "end", "start"]
output = MultiSplit(input, delim_strs_
output == ["a", "-b-", "_c_", "-d"]
We still don't have quite what we want.
There are several problems, only one of which is that a list of strings is not exactly a regular expression.