23

In my quests of optimization, I discovered that that built-in split() method is about 40% faster that the re.split() equivalent.

A dummy benchmark (easily copy-pasteable):

import re, time, random 

def random_string(_len):
    letters = "ABC"
    return "".join([letters[random.randint(0,len(letters)-1)] for i in range(_len) ])

r = random_string(2000000)
pattern = re.compile(r"A")

start = time.time()
pattern.split(r)
print "with re.split : ", time.time() - start

start = time.time()
r.split("A")
print "with built-in split : ", time.time() - start

Why this difference?

hymloth
  • 6,869
  • 5
  • 36
  • 47
  • 2
    Why not? Please don't say "curiosity". What problem do you have that's solved by asking us to read the implementation of `re` and `str` and comment on the differences. Perhaps you could read the implementations, comment on the differences, and ask **specific** questions. – S.Lott Sep 21 '11 at 14:36
  • 1
    I actually expect more than %40 speed increase. Simple is faster. – utdemir Sep 21 '11 at 14:38
  • I thought it was obvious (?) that split() uses some kind of regular expressions, but it does not... – hymloth Sep 21 '11 at 14:45
  • 2
    @hymloth You are confusing Python with Java then (which is the only language I know that uses regex in `String.split()`) – NullUserException Sep 21 '11 at 14:53
  • @NullUserException: Perl's `split()` function also uses regular expressions. – Sven Marnach Sep 21 '11 at 16:00
  • @hymloth _time.clock()_ On Unix, return the current processor time as a floating point number expressed in seconds. The precision, and in fact the very definition of the meaning of “processor time”, depends on that of the C function of the same name, but **in any case, this is the function to use for benchmarking Python or timing algorithms.** (http://docs.python.org/library/time.html#time.clock) – eyquem Sep 21 '11 at 16:00
  • 1
    @Sven Is there anything in Perl that *doesn't* use regexes? – NullUserException Sep 21 '11 at 16:07
  • @eyquem: The module to be used for benchmarking Python is `timeit`. Using `time.clock()` is better than `time.time()`, but still has problems that are solved in the `timeit` module. (One example is that `timeit` turns the garbage collector off while timing.) – Sven Marnach Sep 21 '11 at 16:25
  • @Sven Marnach I agree with you: _timeit_ has more advanced interesting features than *clock*. But I don't agree with the fact that it **IS** the module to use: the description of *clock* expresses that this function is usable for benchmarks and timings, without even saying that it is preferable to use *timeit* ; and for simple measures of times concerning executions having a neat difference, *clock* is plenty sufficient. Before I understood that, I was wearing myself out using *timeit*, until I benchmarked *timeit* an *clock* and observed that *timeit* wasn't useful for my limmited needs. – eyquem Sep 21 '11 at 17:09
  • @hymloth Use **"".join(random.choice(letters) for i in xrange(_len))** instead ( xrange if Python 2.x instead of range) – eyquem Sep 21 '11 at 17:19

3 Answers3

29

re.split is expected to be slower, as the usage of regular expressions incurs some overhead.

Of course if you are splitting on a constant string, there is no point in using re.split().

user1767754
  • 23,311
  • 18
  • 141
  • 164
NullUserException
  • 83,810
  • 28
  • 209
  • 234
  • 1
    @duhaime Because regular expressions are designed for the case where a simple constant string match is not enough. If you do not need that extra power, then it makes sense to use the regular built in split() – Alex Spurling Jan 27 '19 at 16:03
  • Ah, interesting, I didn't know that! Thanks for following up @AlexSpurling – duhaime Jan 28 '19 at 11:35
10

When in doubt, check the source code. You can see that Python s.split() is optimized for whitespace and inlined. But s.split() is for fixed delimiters only.

For the speed tradeoff, a re.split regular expression based split is far more flexible.

>>> re.split(':+',"One:two::t h r e e:::fourth field")
['One', 'two', 't h r e e', 'fourth field']
>>> "One:two::t h r e e:::fourth field".split(':')
['One', 'two', '', 't h r e e', '', '', 'fourth field']
# would require an addition step to find the empty fields...
>>> re.split('[:\d]+',"One:two:2:t h r e e:3::fourth field")
['One', 'two', 't h r e e', 'fourth field']
# try that without a regex split in an understandable way...

That re.split() is only 29% slower (or that s.split() is only 40% faster) is what should be amazing.

the wolf
  • 34,510
  • 13
  • 53
  • 71
4

Running a regular expression means that you are running a state machine for each character. Doing a split with a constant string means that you are just searching for the string. The second is a much less complicated procedure.

Veltzer Doron
  • 934
  • 2
  • 10
  • 31
Dov Grobgeld
  • 4,783
  • 1
  • 25
  • 36