33

My current Python project will require a lot of string splitting to process incoming packages. Since I will be running it on a pretty slow system, I was wondering what the most efficient way to go about this would be. The strings would be formatted something like this:

Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5

Explanation: This particular example would come from a list where the first two items are a title and a date, while item 3 to item 5 would be invited people (the number of those can be anything from zero to n, where n is the number of registered users on the server).

From what I see, I have the following options:

  1. repeatedly use split()
  2. Use a regular expression and regular expression functions
  3. Some other Python functions I have not thought about yet (there are probably some)

Solution 1 would include splitting at | and then splitting the last element of the resulting list at <> for this example, while solution 2 would probably result in a regular expression like:

((.+)|)+((.+)(<>)?)+

Okay, this regular expression is horrible, I can see that myself. It is also untested. But you get the idea.

Now, I am looking for the way that a) takes the least amount of time and b) ideally uses the least amount of memory. If only one of the two is possible, I would prefer less time. The ideal solution would also work for strings that have more items separated with | and strings that completely lack the <>. At least the regular expression-based solution would do that.

My understanding would be that split() would use more memory (since you basically get two resulting lists, one that splits at | and the second one that splits at <>), but I don't know enough about Python's implementation of regular expressions to judge how the regular expression would perform. split() is also less dynamic than a regular expression if it comes to different numbers of items and the absence of the second separator. Still, I can't shake the impression that Python can do this better without regular expressions, and that's why I am asking.

Some notes:

  • Yes, I could just benchmark both solutions, but I'm trying to learn something about Python in general and how it works here, and if I just benchmark these two, I still don't know what Python functions I have missed.
  • Yes, optimizing at this level is only really required for high-performance stuff, but as I said, I am trying to learn things about Python.
  • Addition: in the original question, I completely forgot to mention that I need to be able to distinguish the parts that were separated by | from the parts with the separator <>, so a simple flat list as generated by re.split(\||<>,input) (as proposed by obmarg) will not work too well. Solutions fitting this criterium are much appreciated.

To sum the question up: Which solution would be the most efficient one, for what reasons?

Due to multiple requests, I have run some timeit on the split()-solution and the first proposed regular expression by obmarg, as well as the solutions by mgibsonbr and duncan:

import timeit
import re

def splitit(input):
    res0 = input.split("|")
    res = []
    for element in res0:
        t = element.split("<>")
        if t != [element]:
            res0.remove(element)
            res.append(t)
    return (res0, res)

def regexit(input):
    return re.split( "\||<>", input )


def mgibsonbr(input): # Solution by mgibsonbr
    items = re.split(r'\||<>', input) # Split input in items
    offset = 0
    result = [] # The result: strings for regular items, lists for <> separated ones
    acc = None
    for i in items:
        delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
        offset += len(i) + len(delimiter)
        if delimiter == '<>': # Will always put the item in a list
            if acc is None:
                acc = [i] # Create one if doesn't exist
                result.append(acc)
            else:
                acc.append(i)
        else:
            if acc is not None: # If there was a list, put the last item in it
                acc.append(i)
            else:
                result.append(i) # Add the regular items
            acc = None # Clear the list, since what will come next is a regular item or a new list
    return result

def split2(input): # Solution by duncan
    res0 = input.split("|")
    res1, res2 = [], []
    for r in res0:
        if "<>" in r:
            res2.append(r.split("<>"))
        else:
            res1.append(r)
    return res1, res2

print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split2:", timeit.Timer("split2('a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "mgibs:", timeit.Timer("mgibsonbr('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import mgibsonbr").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "split:", timeit.Timer("split2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()

The results:

mgibs: 14.7349407408
split: 6.403942732
split2: 3.68306812233
regex: 5.28414318792
mgibs: 107.046683735
split: 46.0844590775
split2: 26.5595985591
regex: 28.6513302646

At the moment, it looks like split2 by duncan beats all other algorithms, regardless of length (with this limited dataset at least), and it also looks like mgibsonbr's solution has some performance issues (sorry about that, but thanks for the solution regardless).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
malexmave
  • 1,283
  • 2
  • 17
  • 37
  • 4
    Go on, just use `timeit` and see for yourself. It's easy. My bets are on `str.split`. – wim Mar 07 '12 at 14:04
  • 1
    I think for long string, `re` may be faster. But you have to benchmark to be sure. – Kien Truong Mar 07 '12 at 14:11
  • I'd bet that this is IO bound. My first guess would be to say that any type of splitting will be faster than the input it receives from disk or network. – Steven Rumbalski Mar 07 '12 at 14:30
  • If there are "no invited people" does your string look like this `"Item 1 | Item 2"` or like this `"Item 1 | Item 2 |"`? – Steven Rumbalski Mar 07 '12 at 14:33
  • @StevenRumbalski The second option. – malexmave Mar 07 '12 at 14:36
  • `timeit` isn't "slowing down execution quite a bit". It just performs the operation repeatedly, by default a million times. You can pass an argument to `Timer.timeit()` to use a different number of runs. – Sven Marnach Mar 07 '12 at 15:06
  • @SvenMarnach Oh, right. That explains it. Thanks for the clarification. – malexmave Mar 07 '12 at 15:09
  • I just like to point out something I noticed about your input: what happens if you have zero <> delimited values? what about if you have one? Neither solution presented so far will work if your input is like: `a|b|c|de||a|b|c|de|f|a|b|c|de|f<>ge|a...` I'd suggest making it like `a|b|c|de|<>|a|b|c|de|f<>|a|b|c|de|f<>ge|a...` and then using Duncan's solution (maybe removing empty entries with a list comprehension, as I suggested earlier). – mgibsonbr Mar 07 '12 at 16:11
  • @mgibsonbr Thanks for pointing that out. I will see how exactly the string will look like in the end, but that is a vital consideration that you noticed there. Thanks! – malexmave Mar 07 '12 at 16:15

5 Answers5

25

I was slightly surprised that split() performed so badly in your code, so I looked at it a bit more closely and noticed that you're calling list.remove() in the inner loop. Also you're calling split() an extra time on each string. Get rid of those and a solution using split() beats the regex hands down on shorter strings and comes a pretty close second on the longer one.

import timeit
import re

def splitit(input):
    res0 = input.split("|")
    res = []
    for element in res0:
        t = element.split("<>")
        if t != [element]:
            res0.remove(element)
            res.append(t)
    return (res0, res)

def split2(input):
    res0 = input.split("|")
    res1, res2 = [], []
    for r in res0:
        if "<>" in r:
            res2.append(r.split("<>"))
        else:
            res1.append(r)
    return res1, res2

def regexit(input):
    return re.split( "\||<>", input )

rSplitter = re.compile("\||<>")

def regexit2(input):
    return rSplitter.split(input)

print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2(  'a|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())
print("split: ", timeit.Timer("splitit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit())
print("split2:", timeit.Timer("split2(  'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import split2").timeit())
print("regex: ", timeit.Timer("regexit( 'a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit())
print("regex2:", timeit.Timer("regexit2('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit2").timeit())

Which gives the following result:

split:  1.8427431439631619
split2: 1.0897291360306554
regex:  1.6694280610536225
regex2: 1.2277749050408602
split:  14.356198082969058
split2: 8.009285948995966
regex:  9.526430513011292
regex2: 9.083608677960001

And of course split2() gives the nested lists that you wanted whereas the regex solution doesn't.

Compiling the regex will improve performance. It does make a slight difference, but Python caches compiled regular expressions so the saving is not as much as you might expect. I think usually it isn't worth doing it for speed (though it can be in some cases), but it is often worthwhile to make the code clearer.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Duncan
  • 92,073
  • 11
  • 122
  • 156
  • Good catch. Did not think about that at all when implementing the test split-function. Now that you pointed it out, it makes a lot of sense, since removing from lists is "expensive", not to mention splitting everything regardless of content (I was looking for a java-like `contains`, but it seems python does this differently). Thanks a lot. – malexmave Mar 07 '12 at 15:42
  • You can do even better if you "compile" the regular expression, namely add a line like `rSplitter = re.compile("\||<>")` The define the splitting code: `def regexit2(input): return rSplitter.split(input)` YYMV, but for me, I saw this compiled version as noticeably faster than the original regex version, and comfortably fastest in all tests. – F1Rumors Sep 24 '18 at 18:15
  • 1
    @F1Rumors, thanks. I've updated the example code though the difference isn't massive as Python will be caching the compiled regex. – Duncan Sep 25 '18 at 08:50
  • `Split` is most of the time faster than a `regex`, but it depends on the complexity of the regex. Some new users are [linking to this answer](https://stackoverflow.com/a/65081323/797495) stating that _"re is faster than split."_, which is not entirely true. – Pedro Lobito Nov 30 '20 at 22:00
11

I'm not sure if it's the most efficient, but certainly the easiest to code seems to be something like this:

>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5"
>>> re.split( "\||<>", input )
>>> ['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']

I would think there's a fair chance of it being more efficient than a plain old split as well (depending on the input data) since you'd need to perform the second split operation on every string output from the first split, which doesn't seem likely to be efficient for either memory or time.

Though having said that I could easily be wrong, and the only way to be sure would be to time it.

obmarg
  • 9,369
  • 36
  • 59
  • Thanks for your algorithm. I forgot to mention that I need the second part (Item 3-5) as a nested list to distinguish it from the other parts and to make processing easier. But since I forgot it, I can't really hold it against you that your solution does not do it ;-) – malexmave Mar 07 '12 at 14:44
  • 1
    @malexmave: Once you have a list it's easy. `alist = re.split( "\||<>", input); result = [alist[0], alist[1], alist[2:]]`. Voila. Nested. – Steven Rumbalski Mar 07 '12 at 15:24
  • Thanks, that will definitly work if I have fixed positions inside the list. What about the problem I described in a [comment further down](http://stackoverflow.com/questions/9602856/most-efficient-way-to-split-strings-in-python/9603035#comment12182897_9603443)? Do you have an easy solution to that? Or at least one that is easier than the one proposed by @mgibsonbr? Thanks for your effort! – malexmave Mar 07 '12 at 15:31
3

Calling split multiple times is likely to be inneficient, because it might create unneeded intermediary strings. Using a regex like you proposed won't work, since the capturing group will only get the last item, not every of them. Splitting using a regex, like obmarg suggested, seems to be the best route, assuming a "flattened" list is what you're looking for.

If you don't want a flattened list, you can split using a regex first and then iterate over the results, checking the original input to see which delimiter was used:

items = re.split(r'\||<>', input)
offset = 0
for i in items:
    delimiter = '|' if input[offset+len(i)] == '|' else '<>'
    offset += len(i) + len(delimiter)
    # Do something with i, depending on whether | or <> was the delimiter

At last, if you don't want the substrings created at all (using only the start and end indices to save space, for instance), re.finditer might do the job. Iterate over the delimiters, and do something to the text between them depending on which delimiter (| or <>) was found. It's a more complex operation, since you'll have to handle many corner cases, but might be worth it depending on your needs.

Update: for your particular case, where the input format is uniform, obmarg's solutions is the best one. If you must, post-process the result to have a nested list:

split_result = re.split( "\||<>", input )
result = [split_result[0], split_result[1], [i for i in split_result[2:] if i]]

(that last list comprehension is to ensure you'll get [] instead of [''] if there are no items after the last |)

Update 2: After reading the updated question, I finally understood what you're trying to achieve. Here's the full example, using the framework suggested earlier:

items = re.split(r'\||<>', input) # Split input in items
offset = 0
result = [] # The result: strings for regular itens, lists for <> separated ones
acc = None
for i in items:
    delimiter = '|' if offset+len(i) < len(input) and input[offset+len(i)] == '|' else '<>'
    offset += len(i) + len(delimiter)
    if delimiter == '<>': # Will always put the item in a list
        if acc is None:
            acc = [i] # Create one if doesn't exist
            result.append(acc)
        else:
            acc.append(i)
    else:
        if acc is not None: # If there was a list, put the last item in it
            acc.append(i)
        else:
            result.append(i) # Add the regular items
        acc = None # Clear the list, since what will come next is a regular item or a new list
print result

Tested it with your example, the result was:

['a', 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c','de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'aha'],
 'b', 'c', 'de', ['f', 'ge', 'ah']]
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mgibsonbr
  • 21,755
  • 7
  • 70
  • 112
  • Thanks for your input. I am currently trying to understand what your `for`Block does. As I understand it, it calculates the offset at which the `|`-seperated block ends and the `<>`-seperated block begins, right? – malexmave Mar 07 '12 at 14:47
  • Not really... at each iteration, `offset` is where the next item is. `offset+len(i)` is where the next delimiter (or the end of string) is. I used it so you can quickly find the delimiters without having to scan the input string again. However, if your strings will have a regular format, you don't really need it, see my updated answer. – mgibsonbr Mar 07 '12 at 14:58
  • Thanks for the update. I guess there is no similarily simple way to get the same result with a variable number of items seperated by `|` before the `<>` begin, correct? Not to mention a format like `item 1 | item 2 <> Item 3 | Item 4`, which could of course be avoided by changing the formatting function on the client side. – malexmave Mar 07 '12 at 15:08
  • See my updated answer. The suggested solution with a loop was addressing exactly that, the need to have some nesting depending on the delimiter type. – mgibsonbr Mar 07 '12 at 15:23
  • FYI: It looks like your new algorithm has some performance issues (See OP for benchmark). – malexmave Mar 07 '12 at 15:52
  • You're right. I guess you can't be sure of the performance without benchmarking... I could try to improve my answer, but Duncan's one is clearly better, so I won't bother ;) – mgibsonbr Mar 07 '12 at 16:03
2

If you know that <> is not going to appear elsewhere in the string then you could replace '<>' with '|' followed by a single split:

>>> input = "Item 1 | Item 2 | Item 3 <> Item 4 <> Item 5"
>>> input.replace("<>", "|").split("|")
['Item 1 ', ' Item 2 ', ' Item 3 ', ' Item 4 ', ' Item 5']

This will almost certainly be faster than doing multiple splits. It may or may not be faster than using re.split - timeit is your friend.

On my system with the sample string you supplied, my version is more than three times faster than re.split:

>>> timeit input.replace("<>", "|").split("|")
1000000 loops, best of 3: 980 ns per loop
>>> import re
>>> timeit re.split(r"\||<>", input)
100000 loops, best of 3: 3.07 us per loop

(N.B.: This is using IPython, which has timeit as a built-in command).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Dave Kirby
  • 25,806
  • 5
  • 67
  • 84
  • Thanks for the idea. But this will create a flat list with no way to find out which one of the resulting substrings was seperated by `<>` instead of `|`, which is a requirement for my code, as I edited into the OP. Still, it's a good idea if you don't require nested lists. – malexmave Mar 07 '12 at 15:28
1

You can make use of replace. First replace <> with |, and then split by |.

def replace_way(input):
    return input.replace('<>','|').split('|')

Time performance:

import timeit
import re

def splitit(input):
    res0 = input.split("|")
    res = []
    for element in res0:
        t = element.split("<>")
        if t != [element]:
            res0.remove(element)
            res.append(t)
    return (res0, res)

def regexit(input):
    return re.split( "\||<>", input )
    
def replace_way(input):
    return input.replace('<>','|').split('|')


print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit()
print "split:", timeit.Timer("splitit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import splitit").timeit()
print "regex:", timeit.Timer("regexit('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import regexit").timeit()
print "replace:",timeit.Timer("replace_way('a|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>aha|b|c|de|f<>ge<>ah')","from __main__ import replace_way").timeit()

Results on my machine:

split: 11.8682055461
regex: 12.7430856814
replace: 2.54299265006
split: 79.2124379066
regex: 68.6917008003
replace: 10.944842347
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Akavall
  • 82,592
  • 51
  • 207
  • 251
  • Thank you for your suggestion. However, it has the problem that it does not return nested lists for the "inner" items (the ones seperated by `<>` instead of `|`). – malexmave Mar 07 '12 at 15:40