Split a string "aabbcc" -> ["aa", "bb", "cc"] without re.split

Question

I would like to split a string according to the title in a single call. I'm looking for a simple syntax using list comprehension, but i don't got it yet:

s = "123456"

And the result would be:

["12", "34", "56"]

What i don't want:

re.split('(?i)([0-9a-f]{2})', s)
s[0:2], s[2:4], s[4:6]
[s[i*2:i*2+2] for i in len(s) / 2]

Edit:

Ok, i wanted to parse a hex RGB[A] color (and possible other color/component format), to extract all the component. It seem that the fastest approach would be the last from sven-marnach:

sven-marnach xrange: 0.883 usec per loop

python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'

pair/iter: 1.38 usec per loop

python -m timeit -s 's="aabbcc"' '["%c%c" % pair for pair in zip(* 2 * [iter(s)])]'

Regex: 2.55 usec per loop

python -m timeit -s 'import re; s="aabbcc"; c=re.compile("(?i)([0-9a-f]{2})"); 
split=re.split' '[int(x, 16) / 255. for x in split(c, s) if x != ""]'

Should `"aaabbb"` be split into `["aaa", "bbb"]` or `["aa", "ab", "bb"]`? Why don't you like the example implementations you gave, especially the last one? — Sven Marnach, Feb 08 '12 at 11:49
I would go for the non regex solution (second or third) ... The last one IS list comprehension. — Michel Keijzers, Feb 08 '12 at 11:49
What do you mean by splitting according to the title? What happens if the input string is "aaabbb"? What happens if the input string is "abcdef"? State the problem and the rules of splitting precisely. — Susam Pal, Feb 08 '12 at 11:49
Why not? The `[s[i*2:i*2+2] for i in len(s) / 2]` seems perfect. What's wrong with it? Is this Homework? — S.Lott, Feb 08 '12 at 11:50
@S.Lott: I wouldn't exactly call it perfect, since it isn't even valid Python. `[s[i:i+2] for i in range(0, len(s), 2)]` would be better. — Sven Marnach, Feb 08 '12 at 11:51
I really was thinking it would be possible to write it with list comprehension [::x] or something like that, not using [x for x in ...]. But seem it's not possible :) — tito, Feb 08 '12 at 11:51
It's not restrictions, just wanting to know if it was possible to write it in a more simple form. — tito, Feb 08 '12 at 11:52
@tito: Care to clarify what you are actually trying to do? (See my question above.) — Sven Marnach, Feb 08 '12 at 11:55
I wanted to parse an hex color in a RGB[A] format, and have the fastest execution. — tito, Feb 08 '12 at 11:58
@tito: You should probably ask exactly this in a new question, because the answers will be completely different to the answers in this question. (Hint: use `"aabbcc".decode("hex")` together with `struct.unpack()`.) — Sven Marnach, Feb 08 '12 at 12:01
I think the fastest would be to parse it as a single 32-bit (8-digit) hex number and then split the number into channels through bit shifts and masks, or modulo division. — SF., Feb 08 '12 at 12:04
Surely the fastest will be to maintain a hash table of all possible 6-digit hex strings in a C extension. This sounds like premature optimization to begin with, though. — Wooble, Feb 08 '12 at 12:19
@tito: What if `re.split()` is fastest? Have you used `timeit` yet? — S.Lott, Feb 08 '12 at 13:46

score 4 · Accepted Answer · answered Feb 08 '12 at 13:28

Reading through the comments, it turns out the actual question is: What is the fastest way to parse a color definition string in hexadecimal RRGGBBAA format. Here are some options:

def rgba1(s, unpack=struct.unpack):
    return unpack("BBBB", s.decode("hex"))

def rgba2(s, int=int, xrange=xrange):
    return [int(s[i:i+2], 16) for i in xrange(0, 8, 2)]

def rgba3(s, int=int, xrange=xrange):
    x = int(s, 16)
    return [(x >> i) & 255 for i in xrange(0, 32, 8)]

As I expected, the first version turns out to be fastest:

In [6]: timeit rgba1("aabbccdd")
1000000 loops, best of 3: 1.44 us per loop

In [7]: timeit rgba2("aabbccdd")
100000 loops, best of 3: 2.43 us per loop

In [8]: timeit rgba3("aabbccdd")
100000 loops, best of 3: 2.44 us per loop

score 1 · Answer 2 · edited May 23 '17 at 11:48

In [4]: ["".join(pair) for pair in zip(* 2 * [iter(s)])]
Out[4]: ['aa', 'bb', 'cc']

See: How does zip(*[iter(s)]*n) work in Python? for explanations as to that strange "2-iter over the same str" syntax.

You say in the comments that you want to "have the fastest execution", I can't promise you that with this implementation, but you can measure the execution using timeit. Remember what Donald Knuth said about premature optimisation, of course. For the problem at hand (now that you've revealed it) I think you'd find r, g, b = s[0:2], s[2:4], s[4:6] hard to beat.

$ python3.2 -m timeit -c '
s = "aabbcc"
["".join(pair) for pair in zip(* 2 * [iter(s)])]
'
100000 loops, best of 3: 4.49 usec per loop

Cf.

python3.2 -m timeit -c '
s = "aabbcc"
r, g, b = s[0:2], s[2:4], s[4:6]
'
1000000 loops, best of 3: 1.2 usec per loop

Do you really think this is preferable over `[s[i:i+2] for i in range(0, len(s), 2)]`? — Sven Marnach, Feb 08 '12 at 11:58
@SvenMarnach: "Preferable over"? No, just an alternative (and the first thing that came to my tiny little mind). — johnsyweb, Feb 08 '12 at 12:00

score 0 · Answer 3 · edited Mar 09 '23 at 19:00

Numpy is worse than your preferred solution for a single lookup:

$ python -m timeit -s 'import numpy as np; s="aabbccdd"' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; list(a)'
100000 loops, best of 3: 5.14 usec per loop
$ python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
100000 loops, best of 3: 2.41 usec per loop

But if you do several conversions at once, numpy is much faster:

$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.tolist()'
10000 loops, best of 3: 59.6 usec per loop
$ python -m timeit -s 's="aabbccdd" * 100;' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
1000 loops, best of 3: 240 usec per loop

Numpy is faster for batcher larger than 2, on my computer. You can easily group the values by setting a.shape to (number_of_colors, 4), though it makes the tolist method 50% slower.

In fact, most of the time is spent converting the array to a list. Depending on what you wish to do with the results, you may be able to skip this intermeditary step, and reap some benefits:

$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.shape = (100,4)'
100000 loops, best of 3: 6.76 usec per loop

Split a string "aabbcc" -> ["aa", "bb", "cc"] without re.split

3 Answers3