Unexpected behavior of a function built to replace split ()

Question

I wrote a function to perform better than split() built in function (I know it's not idiomatic python, but I gave my best), so when I pass this argument:

better_split("After  the flood   ...  all the colors came out."," .")

I'd expected this outcome:

['After', 'the', 'flood', 'all', 'the', 'colors', 'came', 'out']

However, surprisingly, the function causes an incomprehensible (to me) behavior. When it reaches the last two words, it does not suppress the more '' and, rather than add to the outcome list "cam" and "out", adds to it "came out" and, so, I got this:

['After', 'the', 'flood', 'all', 'the', 'colors', 'came out']

Does someone with more experience understand why this happens? Thank you in advance for any help!

def better_split(text,markersString):
markers = []
splited = []
for e in markersString:
    markers.append(e)    
for character in text:
    if character in markers:
        point = text.find(character)
        if text[:point] not in character:
            word = text[:point]
            splited.append(word)            
            while text[point] in markers and point+1 < len(text):
                point = point + 1
            text = text[point:]                   
print 'final splited = ', splited

better_split("This is a test-of the,string separation-code!", " ,!-")

better_split("After the flood ... all the colors came out."," .")

split() WITH MULTIPLE SEPARATIONS If you are looking for split() with multiple separations, see: Split Strings with Multiple Delimiters?

The best answer without import re that I found was this:

def my_split(s, seps):
    res = [s]
    for sep in seps:
        s, res = res, []
        for seq in s:
            res += seq.split(sep)
    return res

Tadeck · Answer 1 · 2012-03-17T04:21:14.170

3

Simpler solution

Your better_split function is simpler than you think. I have implemented it as such:

def better_split(s, seps):
    result = [s]
    def split_by(sep):
        return lambda s: s.split(sep)
    for sep in seps:
        result = sum(map(split_by(sep), result), [])
    return filter(None, result)  # Do not return empty elements

Tests

>>> better_split("This is a test-of the,string separation-code!", " ,!-")
['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code']
>>> better_split("After the flood ... all the colors came out."," .")
['After', 'the', 'flood', 'all', 'the', 'colors', 'came', 'out']

Tips about your code

you do not need to change markersString into markers, you can iterate directly through markersString,
text[:point] not in character is True always when point > 1, so is rather useless,
point = text.find(character) will give you point = -1 every time character is not found in text,
try to simplify your code, one of the rules of Python says: "If something is difficult to explain, it is a bad idea". Unfortunately, your code is even hard to read, containing a lot of redundant statements, plus statements that look like they are supposed to work differently than they are (eg. using str.find for getting place of the separator and then using it without checks for getting slices,

edited Mar 17 '12 at 04:21

answered Mar 17 '12 at 03:55

Tadeck

132,510
28
152
198

Thanks, Tadeck, for this nice code sample! I'm studying it now. Unfortunately, it uses some features (lambda, filter (), map ()) I which I still have not learned and I'm not suppose to use it in this work (hence why my code is so big and confusing...). So I would really like to understand what happens with this code and this strange behavior. Once again, thank you! – craftApprentice Mar 17 '12 at 04:02
@Pythonista'sApprentice: Ok, I will give you my feedback. When it comes to lambda, filter and map... `lambda` is just anonymous function, so it can be called and then returns the result calculated after `:`. `filter` filters iterable by applying specific function - if `None` is given instead of function, then `filter` returns only elements evaluated as truish (eg. skips empty strings). `map` just returns a list with every element processed by the function given in the argument. – Tadeck Mar 17 '12 at 04:08
even simpler: [`filter(None, re.split("|".join(map(re.escape, separators)), text))`](http://stackoverflow.com/a/9747555/4279) – jfs Mar 17 '12 at 05:45
Tadeck, thank you again for these valuable lessons. They will make me a better programmer (or apprentice). I've turned markerString into markers to check this: "text[point] in markers". – craftApprentice Mar 17 '12 at 14:26

score 3 · Answer 2 · answered Mar 17 '12 at 04:14

The problem is that this:

    for character in text:

is looping over the characters in the initial string — the original value of text — while this:

        point = text.find(character)

searches for the delimiter in the current string — the current value of text. So that part of your function operates on the assumption that you handle one delimiter-character at a time; that is, it assumes that whenever you come across a delimiter-character in your loop over original text, that this is the first delimiter-character in the current text.

Meanwhile, this:

            while text[point] in markers and point+n < len(text):
                point = point + 1
            text = text[point:]

serves to remove multiple delimiters at once; its goal is to remove a sequence of consecutive delimiter characters. This violates the assumption of the above-mentioned code that only one delimiter is handled at a time.

So the processing goes like this:

  [After  the flood   ...  all the colors came out.]
handling first space after "After":
  [After] [the flood   ...  all the colors came out.]
handling second space after "After":
  [After] [the] [flood   ...  all the colors came out.]
handling space after "the":
  [After] [the] [flood] [all the colors came out.]
handling first space after "flood":
  [After] [the] [flood] [all] [the colors came out.]
handling second space after "flood":
  [After] [the] [flood] [all] [the] [colors came out.]
handling third space after "flood":
  [After] [the] [flood] [all] [the] [colors] [came out.]
handling first period of the "...":
  [After] [the] [flood] [all] [the] [colors] [came out] []
-- text is now empty, no more splitting happens

As you can see, the delimiter that you're handling doesn't end up being the delimiter that you split on.

The solution is simply to remove the logic that lets you skip multiple delimiters at once — that is, change this:

            while text[point] in markers and point+n < len(text):
                point = point + 1
            text = text[point:]

to this: text = text[(point + 1):]

and instead, right before you add word to splited, make sure it's nonempty:

            if len(word) > 0:
                splited.append(word)

Thanks, @ruakh, your nice and detailed answer made me realize my mistake. I don't know if I completely understand your suggestions, but I try this code based on them: — craftApprentice, Mar 17 '12 at 13:56
def better_split(text,markersString): markers = [] splited = [] for e in markersString: markers.append(e) for character in text: if character in markers: point = text.find(character) word = text[:point] if len(word) > 0: splited.append(word) text = text[point:] print 'final splited = ', splited with this output: ['After', ' the flood ', '..', ' all the colors came out'] — craftApprentice, Mar 17 '12 at 13:58

Reorx · Accepted Answer · 2012-03-17T05:11:57.600

The point is, the iterator was created and became constant when this line:

for character in text:

was executed,

but your aim is to iter the changed text after every for loop.

So the solution is, move the for loop into a inner function and use it recursively:

def better_split(text,markersString):
    # simple and better way for 'for e in markerString...'
    markers = list(markersString)
    splited = []

    # there is no need to assign variable n, we all know it should be 1
    # n = 1    

    def iter_text(text):
        # check if text is an empty string,
        # NOTE this `text` will cover `text` in upper function as to local scope,
        # so it's actually the text everytime iter_text() get,
        # not the one better_split() get.
        if not text:
            return
        # [UPDATES 2012-03-17 01:07 EST]
        # add a flag to judge if there are markers in `text`
        _has_marker = False
        for character in text:
            if character in markers:
                # set `_has_marker` to True to indicate `text` has been handled
                _has_marker = True
                point = text.find(character)
                word = text[:point]
                splited.append(word)
                # check if text[point] is legal, to prevent raising of IndexError
                while point + 1 <= len(text) and text[point] in markers:
                    point = point + 1
                text = text[point:]
                # break the loop when you find a marker
                # and change `text` according to it,
                # so that the new loop will get started with changed `text`
                break
        # if no marker was found in `text`, add the whole `text` to `splited`
        if not _has_marker:
            splited.append(text)
        else:
            iter_text(text)

    iter_text(text)

    print 'final splited = ', splited

Other details please see the comments in code.

BTW, may be using builtin function assembly is simpler, although I also think achieve an algorithm independently is a good way to learn language :)

def better_split(s, seprators):
    assert isinstance(seprators, str), 'seprators must be string'
    buf = [s]
    for sep in seprators:
        for loop, text in enumerate(buf):
            buf[loop:loop+1] = [i for i in text.split(sep) if i]
    return buf

Thank you very much, @Reorx for your code and your tips. You do fix my code (although it now seems to be a bit more complex than before), providing a code that gives the expected outcome. Thank you very much for your time!!! — craftApprentice, Mar 17 '12 at 14:09
I know (and you also know) that this procedure is not the best way to do what I wanted, but this answer showed me how to do what I had in mind and it reused the code I had written. Moreover, this response has taught me new things, like writing an internal function and use a flag to accomplish what I had in mind. — craftApprentice, Mar 17 '12 at 14:18
@Pythonista'sApprentice Thanks very much for your appreciation, it makes me feel great to see my answers be useful to you :) There's still something I can tell you in your previous code: str in python can be performed as list, `i in a_str` is just the same with `i in list(a_str)`, so you didn't need to make _markersString_ to be list _markers_. Moreover, later in the day, I thought this function can be enhanced: you now accomplish a function that can pass multiple *single* characters to split text, how about passing multiple seperators that just string and has no lenght limit (currently 1)? — Reorx, Mar 17 '12 at 15:54
Thank you, @Reorx, for this explanation about i in a_str !! I learned a lot with you! Your question is very good. I'll think about it. I'm also with another doubt related to this question, if you can give me a hand: http://stackoverflow.com/questions/9752707/an-elegant-solution-to-prevent-that-an-algorithm-similar-to-the-split-but-wi Thanks again! — craftApprentice, Mar 17 '12 at 19:14

score 2 · Answer 4 · answered Mar 17 '12 at 05:30

better_split() is not a good name. How "better", in what way?

yourmodule.split() is enough to differentiate it from any other split() function.

You can implement it using re.split():

import  re

def split(text, separators):
    re_sep = re.compile(r"(?:{0})+".format("|".join(map(re.escape, separators))))
    return filter(None, re_sep.split(text))

Example

>>> split("After  the flood   ...  all the colors came out.", " .")
['After', 'the', 'flood', 'all', 'the', 'colors', 'came', 'out']

If you're not allowed to use map, filter then you could easily replace them:

"|".join(map(re.escape, separators)):

"|".join(re.escape(s) for s in separators)

filter(None, re_sep.split(text)):
```
[s for s in re_sep.split(text) if s]
```

+1 For the shortest alternative solution within all the answers to this question. — Tadeck, Mar 17 '12 at 19:33

Ravikiran D · Answer 5 · 2017-08-19T14:37:29.977

0

def spli(str,sep=' '):
    index=0
    string=''
    list=[]
    while index<len(str):
       if(str[index] not in sep):
          string+=str[index]
       elif(str[index] in sep):
          list.append(string)
          string=''
       index+=1
    if string:list.append(string)
        return(list)
n='hello'
print(spli(n))

output:
 ['h','e','l','l','o']

edited Aug 19 '17 at 14:37

answered Aug 19 '17 at 13:34

Ravikiran D

329
3
8

Elaborate your answer so that others know how and why your answer will be one of a solution to the problem. – UmarZaii Aug 19 '17 at 14:01
Its a source code of "split" . A built in function in string module – Ravikiran D Oct 21 '17 at 13:38

Unexpected behavior of a function built to replace split ()

5 Answers5

Simpler solution

Tests

Tips about your code

Example