2

I am going through almost 120 Billion string combinations. I am trying to find the most speed optimized way of determining if the string in question has 3 (or more) sequential duplicate characters.

Ex:

string = "blah"

The test should return false.

string = "blaaah"

This would return true.

I successfully implemented a basic for loop that looped through each string's characters and compared the next character for a match. This worked, but for the quantity of strings I am filtering through I really would like to optimize it.

Any suggestions? Thanks!

w00tw00t111
  • 359
  • 5
  • 22
  • In some cases a basic for loop is actually the fastest. You should do some timings against your own code before you choose anything else – jamylak Jan 18 '15 at 06:54
  • did a quick calculation, for 5000 characters per second (better than what my machine gets), it would take you over 7 years, not counting file I/O, calculating 24/7 if you were to do this single threaded, assuming your average string has 10 characters. – JDong Jan 19 '15 at 21:19
  • @JDong yeah totally right. Multithreads across several instances. Huge dataset, spinning up multi ec2 instances seems like the only logical way to solve this. – w00tw00t111 Jan 20 '15 at 23:21
  • I would also suggest trying C. It's generally much faster. – JDong Jan 21 '15 at 22:08

3 Answers3

5

Through re module.

>>> def consecutive(string):
        if re.search(r'(.)\1\1', string):
            print('True')
        else:
            print('False')


>>> consecutive('blah')
False
>>> consecutive('blaah')
False
>>> consecutive('blaaah')
True
>>> consecutive('blaaaah')
True

() called capturing group which is used to capture characters which are matched by the pattern present inside that group. \1 back-references the characters present inside the capturing group.In the string blaaah, (.) captures the first a and checks for the immediate two occurrences of a. So aaa got matched.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
4

You could instead use itertools.groupby() here. You'll still have to scan through the string, but so would a regex:

from itertools import groupby

three_or_more = (char for char, group in groupby(input_string)
                 if sum(1 for _ in group) >= 3)

This produces a generator; iterate over it to list all characters that are found 3 or more times, or use any() to see if there is at least one such group:

if any(three_or_more):
    # found at least one group of consecutive characters that
    # consists of 3 or more.

Unfortunately for me, the re solution is more efficient here:

>>> from timeit import timeit
>>> import random
>>> from itertools import groupby
>>> import re
>>> import string
>>> def consecutive_groupby(string):
...     three_or_more = (char for char, group in groupby(string)
...                      if sum(1 for _ in group) >= 3)
...     return any(three_or_more)
... 
>>> def consecutive_re(string):
...     return re.search(r'(.)\1\1', string) is not None
... 
>>> # worst-case: random data with no consecutive strings
...
>>> test_string = ''.join([random.choice(string.ascii_letters) for _ in range(1000)])
>>> consecutive_re(test_string), consecutive_groupby(test_string)
(False, False)
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_re as consecutive', number=10000)
0.19730806350708008
>>> timeit('consecutive(s)', 'from __main__ import test_string as s, consecutive_groupby as consecutive', number=10000)
4.633949041366577
>>> # insert repeated characters
...
>>> test_string_with_repeat = test_string[:100] + 'aaa' + test_string[100:]
>>> consecutive_re(test_string_with_repeat), consecutive_groupby(test_string_with_repeat)
(True, True)
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_re as consecutive', number=10000)
0.03344106674194336
>>> timeit('consecutive(s)', 'from __main__ import test_string_with_repeat as s, consecutive_groupby as consecutive', number=10000)
0.4827418327331543

The regular expression approach given by Avinash is the clear winner here, which goes to show you should always measure alternatives.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Oh man that's really great: `It generates a break or new group every time the value of the key function changes` (from the docs). I don't think it gets more efficient than this. – JDong Jan 18 '15 at 04:44
  • Why use this instead of a regular expression? – Håken Lid Jan 18 '15 at 08:34
  • 2
    @HåkenLid: I was hoping it would be the faster option. Measuring with `timeit` shows I was wrong, however. You should use the regular expression. – Martijn Pieters Jan 18 '15 at 12:13
  • I would have to guess the `sum(1 _ for _ in group)` slowed it down? Perhaps the regex compiles to something better internally. Great benchmarking. – JDong Jan 19 '15 at 20:42
  • @JDong: yes, creating a new stack frame for almost every single character doesn't help. – Martijn Pieters Jan 19 '15 at 20:45
  • @MartijnPieters Awesome answer - thank you for benchmarking. Original solution I used was groupby() too. Thinking it was a core function it would be fast but even the for loop was better optimized. Super surprised the re compiled faster. Awesome benchmarking! – w00tw00t111 Jan 20 '15 at 23:19
1

You could define a capture group pattern, then search for it repeated:

import re

s = 'blaaah'
p = '(?P<g>.)(?P=g){2}'

m = re.search(p, s, re.M)
print(m).group(0)

Result:

aaa
l'L'l
  • 44,951
  • 10
  • 95
  • 146
  • 1
    may i know the differences between your's and mine? – Avinash Raj Jan 18 '15 at 04:40
  • @AvinashRaj, They are basically the same other than here I've defined a group name `` for the pattern that checks for itself `{2}` additional times (which yours does using back references). I've also indicated that the search return any results in the form of a string, instead of true or false. :) – l'L'l Jan 18 '15 at 04:45
  • `(?P=g){2}` = `\1` . I used the group number. But you used the group name. Both are exactly the same thing. They does the same job. A single regex would be written in any number of forms , `re.search(r'(.)(?=\1\1)', 'blaaah')` or `re.search(r'(.)(?=\1{2})', 'blaaah')` or `re.search(r'(.)(?=\1{2,})', 'blaaah')` or `re.search(r'(.)\1{2,}', 'blaaah')`. – Avinash Raj Jan 18 '15 at 04:49
  • I think you mean `(?P=g){2}` = `\1\1`. – l'L'l Jan 18 '15 at 04:54