Python re.split() vs split()

Question

In my quests of optimization, I discovered that that built-in split() method is about 40% faster that the re.split() equivalent.

A dummy benchmark (easily copy-pasteable):

import re, time, random 

def random_string(_len):
    letters = "ABC"
    return "".join([letters[random.randint(0,len(letters)-1)] for i in range(_len) ])

r = random_string(2000000)
pattern = re.compile(r"A")

start = time.time()
pattern.split(r)
print "with re.split : ", time.time() - start

start = time.time()
r.split("A")
print "with built-in split : ", time.time() - start

Why this difference?

Why not? Please don't say "curiosity". What problem do you have that's solved by asking us to read the implementation of `re` and `str` and comment on the differences. Perhaps you could read the implementations, comment on the differences, and ask **specific** questions. — S.Lott, Sep 21 '11 at 14:36
I actually expect more than %40 speed increase. Simple is faster. — utdemir, Sep 21 '11 at 14:38
I thought it was obvious (?) that split() uses some kind of regular expressions, but it does not... — hymloth, Sep 21 '11 at 14:45
@hymloth You are confusing Python with Java then (which is the only language I know that uses regex in `String.split()`) — NullUserException, Sep 21 '11 at 14:53
@NullUserException: Perl's `split()` function also uses regular expressions. — Sven Marnach, Sep 21 '11 at 16:00
@hymloth _time.clock()_ On Unix, return the current processor time as a floating point number expressed in seconds. The precision, and in fact the very definition of the meaning of “processor time”, depends on that of the C function of the same name, but **in any case, this is the function to use for benchmarking Python or timing algorithms.** (http://docs.python.org/library/time.html#time.clock) — eyquem, Sep 21 '11 at 16:00
@eyquem: The module to be used for benchmarking Python is `timeit`. Using `time.clock()` is better than `time.time()`, but still has problems that are solved in the `timeit` module. (One example is that `timeit` turns the garbage collector off while timing.) — Sven Marnach, Sep 21 '11 at 16:25
@Sven Marnach I agree with you: _timeit_ has more advanced interesting features than *clock*. But I don't agree with the fact that it **IS** the module to use: the description of *clock* expresses that this function is usable for benchmarks and timings, without even saying that it is preferable to use *timeit* ; and for simple measures of times concerning executions having a neat difference, *clock* is plenty sufficient. Before I understood that, I was wearing myself out using *timeit*, until I benchmarked *timeit* an *clock* and observed that *timeit* wasn't useful for my limmited needs. — eyquem, Sep 21 '11 at 17:09
@hymloth Use **"".join(random.choice(letters) for i in xrange(_len))** instead ( xrange if Python 2.x instead of range) — eyquem, Sep 21 '11 at 17:19

score 29 · Accepted Answer · edited Dec 01 '17 at 20:31

29

re.split is expected to be slower, as the usage of regular expressions incurs some overhead.

Of course if you are splitting on a constant string, there is no point in using re.split().

edited Dec 01 '17 at 20:31

user1767754

23,311
18
141
164

answered Sep 21 '11 at 14:36

NullUserException

83,810
28
209
234

1

@duhaime Because regular expressions are designed for the case where a simple constant string match is not enough. If you do not need that extra power, then it makes sense to use the regular built in split() – Alex Spurling Jan 27 '19 at 16:03
Ah, interesting, I didn't know that! Thanks for following up @AlexSpurling – duhaime Jan 28 '19 at 11:35

the wolf · Answer 2 · 2011-09-22T15:13:12.390

When in doubt, check the source code. You can see that Python s.split() is optimized for whitespace and inlined. But s.split() is for fixed delimiters only.

For the speed tradeoff, a re.split regular expression based split is far more flexible.

>>> re.split(':+',"One:two::t h r e e:::fourth field")
['One', 'two', 't h r e e', 'fourth field']
>>> "One:two::t h r e e:::fourth field".split(':')
['One', 'two', '', 't h r e e', '', '', 'fourth field']
# would require an addition step to find the empty fields...
>>> re.split('[:\d]+',"One:two:2:t h r e e:3::fourth field")
['One', 'two', 't h r e e', 'fourth field']
# try that without a regex split in an understandable way...

That re.split() is only 29% slower (or that s.split() is only 40% faster) is what should be amazing.

score 4 · Answer 3 · edited Jan 22 '18 at 18:17

4

Running a regular expression means that you are running a state machine for each character. Doing a split with a constant string means that you are just searching for the string. The second is a much less complicated procedure.

edited Jan 22 '18 at 18:17

Veltzer Doron

934
2
10
31

answered Sep 21 '11 at 14:35

Dov Grobgeld

4,783
1
25
36

1

@eyquem That does search without use of a state machine. – derenio Jan 15 '13 at 10:18

Python re.split() vs split()

3 Answers3

Linked