4

What is the most efficient way in Python to double (or repeat n times) every letter in a string?

"abcd" -> "aabbccdd"

or

"abcd" -> "aaaabbbbccccdddd"

I have a long string that needs to be mutated in this fashion and the current solution involves a loop with n concatenations for every letter, which I imagine can be more efficient.

Tim
  • 4,790
  • 4
  • 33
  • 41
  • Do you _really_ need the most efficient? If one way takes 309ns and another takes 316ns, are you going to pick the first even if it's less readable, less flexible, etc.? – abarnert Jun 04 '13 at 00:08
  • No, I don't _really_ need it to be the most efficient, but the previous implementation was super verbose and I knew there must be a more "Python" way of doing it that would also be faster. – Tim Jun 04 '13 at 00:17
  • That's my point. Usually, being concise/readable/flexible/pythonic is far more important than being efficient. And from your comment, it sounds like that's true in your case. Don't ask for "most efficient" if that's not what you want. – abarnert Jun 04 '13 at 00:20

8 Answers8

11

Use str.join:

>>> strs = "abcd"
>>> "".join([x*2 for x in strs])
'aabbccdd'
>>> "".join([x*4 for x in strs])
'aaaabbbbccccdddd'

From docs:

s = ""
for substring in list:
    s += substring

Use s = "".join(list) instead. The former is a very common and catastrophic mistake when building large strings.

Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • Why create an intermediate list? – arshajii Jun 03 '13 at 23:48
  • 4
    @arshajii http://stackoverflow.com/questions/9060653/list-comprehension-without-python/9061024#9061024 – Ashwini Chaudhary Jun 03 '13 at 23:49
  • That's news to me as well, as [PEP 289](http://www.python.org/dev/peps/pep-0289/) still indicates that generator expressions conserve memory better than list comprehensions. – Justin S Barrett Jun 03 '13 at 23:56
  • 1
    @JustinSBarrett but a join can make better use of a list, than it can a generator... (known vs unknown) – Jon Clements Jun 03 '13 at 23:57
  • 1
    @JustinSBarrett: If you read the answer Ashwini linked, Raymond Hettinger explains exactly that: *In general*, genexps conserve memory; in the particular case of `join` (or any other function that needs to make multiple passes), they can't. – abarnert Jun 04 '13 at 00:07
  • I consider this solution the most readable, but you can improve it a little with "".join(x*4 for x in strs) to avoid the eager evaluation. – dstromberg Jun 04 '13 at 02:01
8

Since you specifically asked about efficiency:

# drewk's answer, optimized by using from_iterable instead of *
def double_chain(s):
    return ''.join(chain.from_iterable(zip(s, s)))

# Ashwini Chaudhary's answer
def double_mult(s):
    return ''.join([x*2 for x in s])

# Jon Clements' answer, but optimized to take the re.compile and *2 out of the loop.
r = re.compile('(.)')
def double_re(s):
    return r.sub(r'\1\1', s)

Now:

In [499]: %timeit double_chain('abcd')
1000000 loops, best of 3: 1.99 us per loop
In [500]: %timeit double_mult('abcd')
1000000 loops, best of 3: 1.25 us per loop
In [501]: %timeit double_re('abcd')
10000 loops, best of 3: 22.2 us per loop

So, the itertools method is about 60% slower than the simplest method, and using regexes is more than an order of magnitude slower still.

But a tiny string like this may not be representative for longer strings, so:

In [504]: %timeit double_chain('abcd' * 10000)
100 loops, best of 3: 4.92 ms per loop
In [505]: %timeit double_mult('abcd' * 10000)
100 loops, best of 3: 5.57 ms per loop
In [506]: %timeit double_re('abcd' * 10000)
10 loops, best of 3: 91.5 ms per loop

As expected, the itertools method gets better (and now beats the simple way), and the regexp gets even worse as the string gets longer.

So, there is no one "most efficient" way. If you're doubling billions of tiny strings, Ashwini's answer is the best. If you're doubling millions of big strings, or thousands of huge strings, drewk's is best. And if you're doing neither… there's no reason to be optimizing this in the first place.

Also, usual caveats: This test is 64-bit CPython 3.3.0 on my Mac with no load; no guarantees the same will be true for your Python implementation, version, and platform, in your app, with your real data. A quick test with 32-bit 2.6 showed similar results, but if it matters, you need to run a more realistic and relevant test yourself.

abarnert
  • 354,177
  • 51
  • 601
  • 671
2

You can use join, izip and chain:

>>> st='abcd'
>>> from itertools import chain,izip
>>> ''.join(chain(*izip(st,st)))
'aabbccdd'

While it is less readable than a list comprehension, the advantage is no intermediate lists; izip and chain produce iterators.

dawg
  • 98,345
  • 23
  • 131
  • 206
  • There is still one intermediate list, because if you pass `join` anything that's not a sequence, it will make a list out of it. See the link from Ashwini Chaudhary's comment to his own answer. – abarnert Jun 04 '13 at 00:21
  • 1
    Also, `chain.from_iterable(it)` is usually considered more readable than `chain(*it)` (it wouldn't have been added to `itertools` otherwise), and it's also slightly faster. – abarnert Jun 04 '13 at 00:22
  • But +1, because it's a nice way to think about the problem (especially if you extend it to handling n>2), and it also happens to be the fastest for non-trivial strings. – abarnert Jun 04 '13 at 00:23
  • @abarnert what can we do about this `chain(*it)` epidemic? – jamylak Jun 04 '13 at 02:19
  • @jamylak: If only SO ran every answer through a compile and lint step. :) – abarnert Jun 04 '13 at 02:28
1

I would have gone for str.join, so I'll offer up the re option for an alternative:

>>> s = "abcd"
>>> import re
>>> re.sub('(.)', r'\1' * 2, s)
'aabbccdd'
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
1

Whenever the question is: "What's the most efficient way to map every character of a string to something else" turns out str.translate is the best option... for big enough strings:

def double_translate(s):
    return s.translate({ord(x):2*x for x in set(s)})

Timings against other answers:

In [5]: %timeit double_chain('abcd')
The slowest run took 11.03 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 992 ns per loop

In [6]: %timeit double_chain('mult')
The slowest run took 13.61 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1 µs per loop

In [7]: %timeit double_mult('abcd')
The slowest run took 7.59 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 869 ns per loop

In [8]: %timeit double_re('abcd')
The slowest run took 8.63 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 9.4 µs per loop

In [9]: %timeit double_translate('abcd')
The slowest run took 5.80 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 1.78 µs per loop

In [10]: %%timeit t='abcd'*5000
    ...: double_chain(t)
    ...: 
1000 loops, best of 3: 1.66 ms per loop

In [11]: %%timeit t='abcd'*5000
    ...: double_mult(t)
    ...: 
100 loops, best of 3: 2.35 ms per loop

In [12]: %%timeit t='abcd'*5000
    ...: double_re(t)
    ...: 
10 loops, best of 3: 30 ms per loop

In [13]: %%timeit t='abcd'*5000
    ...: double_translate(t)
    ...: 
1000 loops, best of 3: 1.03 ms per loop

Note however that this solution has the added advantage that in some cases you may avoid re-building the table to be passed to translate, for example:

def double_translate_opt(s, table=None):
    if table is None:
        table = {ord(x):2*x for x in set(s)}
    return s.translate(table)

This would avoid some overhead making it even faster:

In [19]: %%timeit t='abcd'; table={ord(x):2*x for x in t}
    ...: double_translate_opt(t, table)
    ...: 
The slowest run took 17.59 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 452 ns per loop

As you can see with small strings it's two times as fast as the current answer, provided that you avoid constructing the table every time. For long texts the costs of building the table is repaid in the speed of translate (and using set is worth it in those cases to avoid many calls to ord).

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
0
def double_letter(str):
    strReturn = ''
    for chr in str:
        strReturn += chr*n
    return strReturn
Faizan
  • 25
  • 1
  • 7
0
def crazy(words):
    return "".join([letter * 2 for letter in words])
Arpan Saini
  • 4,623
  • 1
  • 42
  • 50
JUJU
  • 1
  • 1
0
def crazy(words,n):
    return "".join([letter * n for letter in words])

// function call
print(crazy("arpan",3))

OUTPUT :

aaarrrpppaaannn

Arpan Saini
  • 4,623
  • 1
  • 42
  • 50