3

Suppose I have a string such as "Let's split this string into many small ones" and I want to split it on this, into and ones

such that the output looks something like this:

["Let's split", "this string", "into many small", "ones"]

What is the most efficient way to do it?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
cHaTrU
  • 95
  • 1
  • 9
  • possible duplicate of [Python strings split with multiple separators](http://stackoverflow.com/questions/1059559/python-strings-split-with-multiple-separators) – Kugel Dec 18 '12 at 14:55
  • no this deals with separating at particular sequence of characters – cHaTrU Dec 18 '12 at 14:59
  • I don't see the difference. Python does not have character type only strings. – Kugel Dec 18 '12 at 15:01
  • only similarity that I see is that both questions can be solved using regular expressions, aprt from that both questions are quite specific and hence different. However, if you pointed to some question giving general explanation about regular expressions that can be said to be similar only in the sense that it gives a general overview of the field, nothing more. – cHaTrU Dec 18 '12 at 15:07
  • You ask to split a string having multiple separators. The same as the other question. The only difference is you provided different string example. – Kugel Dec 18 '12 at 15:18
  • so can you split this string with the answer for that example? – cHaTrU Dec 18 '12 at 15:25
  • This is different than the other. Specifically, my answer allows OP to keep the separators in the string. The answer by Ignacio requires a somewhat clever use of non-capturing groups/lookahead to split on a single separator (whitespace) but only under special conditions. Either way, the original question is a bit more restricted than the supposed duplicate and requires a more finely tuned answer. – mgilson Dec 18 '12 at 15:31

3 Answers3

11

With a lookahead.

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']
Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
3

By using re.split():

>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']

By putting the words to split on in a capturing group, the output includes the words we split on.

If you need the spaces removed, use map(str.strip, result) on the re.split() output:

>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']

and you could use filter(None, result) to remove any empty strings if need be:

>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']

To split on words but keep them attached to the following group, you need to use a lookahead assertion instead:

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this, into and ones.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • thanks Martijn, but this splits the words apart as well. what I need is that the split happens at the location of those words. – cHaTrU Dec 18 '12 at 15:01
1

Here's a fairly lazy way to do it:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )

I don't know for sure if this is the most efficient, but it's pretty clean.

Basically, we just iterate through the matches taking 1 piece at a time. The pieces are determined by the index in the string (s) where the regex starts to match. We just chop the string up until that point and we save that index as the start point of the next slice.


As for performance, ignacio clearly wins this round:

9.1412050724  -- Me
3.09771895409  -- ignacio

Code:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]


def me(regex,s):
    return list(resplit(regex,s))

def ignacio(regex,s):
    return regex.split("Let's split this string into many small ones")

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')

import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • thanks mgilson, it works. Can you explain just the big picture? – cHaTrU Dec 18 '12 at 15:05
  • @cHaTrU -- This does leave some spaces on the strings. I don't know if that's an issue. This is the main difference between my answer and the answer by Ignacio Vazquez-Abrams. Both are valid for your problem, but his is probably better. Mine would be better if you wanted the spaces left on the string. (or wanted to split words in the middle) – mgilson Dec 18 '12 at 15:09
  • @downvoter -- Please let me know how I can improve this answer. I'm happy to make it better so that it can be useful for generations to come :) – mgilson Dec 18 '12 at 15:12
  • i upvoted it man! for me it was pretty simple to understand solution – cHaTrU Dec 18 '12 at 15:14
  • 1
    @cHaTrU -- I'm acutally timing it now (compared to the other solution). I'll post the results -- even if they aren't in my favor :) – mgilson Dec 18 '12 at 15:17
  • 1
    @cHaTrU -- timing's posted. Ignacio wins, hands down :). I still assert that my answer could possibly fill a (very small) niche that his doesn't though, so I'll not delete it. – mgilson Dec 18 '12 at 15:21
  • I think its a pretty simple to understand solution, btw any idea on the general performance of regular expressions regarding timing? – cHaTrU Dec 18 '12 at 15:23
  • 1
    The performance of an `re` is highly dependent on the regex. Something simple like this should be OK. The benefit of using `re.split` for the whole thing is that it can be optimized in C code whereas my code needs to have the overhead of 2 python generators + a list build and some additional loop overhead as well (which is pretty much free if you can optimize it in C). – mgilson Dec 18 '12 at 15:26
  • @ mgilson can I provide a list of delimiters instead of (this|into|ones)? – cHaTrU Dec 18 '12 at 15:39
  • 1
    @cHaTrU -- Sure: `'('+'|'.join(iterable_delimiters) + ')'`, or if they might have regex special characters: `"({0})".format("|".join( "(?:{0})".format(re.escape(x)) for x in delimiters ))` -- I might have gone overboard in my non-capturing groups there ... I'm not a regex master ;-) – mgilson Dec 18 '12 at 15:40
  • thanks. Can you please tell me about some resources where I can learn some regex skills? – cHaTrU Dec 18 '12 at 16:02
  • @cHaTrU -- Everything I've learned from the python docs on the `re` module and from reading SO posts + an occasional google search. I'm sure there are better ways to do it, but I don't know them. – mgilson Dec 18 '12 at 16:03