-3

I would like to split a string according to the title in a single call. I'm looking for a simple syntax using list comprehension, but i don't got it yet:

s = "123456"

And the result would be:

["12", "34", "56"]

What i don't want:

re.split('(?i)([0-9a-f]{2})', s)
s[0:2], s[2:4], s[4:6]
[s[i*2:i*2+2] for i in len(s) / 2]

Edit:

Ok, i wanted to parse a hex RGB[A] color (and possible other color/component format), to extract all the component. It seem that the fastest approach would be the last from sven-marnach:

  1. sven-marnach xrange: 0.883 usec per loop

    python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
    
  2. pair/iter: 1.38 usec per loop

    python -m timeit -s 's="aabbcc"' '["%c%c" % pair for pair in zip(* 2 * [iter(s)])]'
    
  3. Regex: 2.55 usec per loop

    python -m timeit -s 'import re; s="aabbcc"; c=re.compile("(?i)([0-9a-f]{2})"); 
    split=re.split' '[int(x, 16) / 255. for x in split(c, s) if x != ""]'
    
Rik Poggi
  • 28,332
  • 6
  • 65
  • 82
tito
  • 12,990
  • 1
  • 55
  • 75
  • Should `"aaabbb"` be split into `["aaa", "bbb"]` or `["aa", "ab", "bb"]`? Why don't you like the example implementations you gave, especially the last one? – Sven Marnach Feb 08 '12 at 11:49
  • I would go for the non regex solution (second or third) ... The last one IS list comprehension. – Michel Keijzers Feb 08 '12 at 11:49
  • What do you mean by splitting according to the title? What happens if the input string is "aaabbb"? What happens if the input string is "abcdef"? State the problem and the rules of splitting precisely. – Susam Pal Feb 08 '12 at 11:49
  • Why not? The `[s[i*2:i*2+2] for i in len(s) / 2]` seems perfect. What's wrong with it? Is this Homework? – S.Lott Feb 08 '12 at 11:50
  • 1
    @S.Lott: I wouldn't exactly call it perfect, since it isn't even valid Python. `[s[i:i+2] for i in range(0, len(s), 2)]` would be better. – Sven Marnach Feb 08 '12 at 11:51
  • I really was thinking it would be possible to write it with list comprehension [::x] or something like that, not using [x for x in ...]. But seem it's not possible :) – tito Feb 08 '12 at 11:51
  • It's not restrictions, just wanting to know if it was possible to write it in a more simple form. – tito Feb 08 '12 at 11:52
  • 2
    @tito: Care to clarify what you are actually trying to do? (See my question above.) – Sven Marnach Feb 08 '12 at 11:55
  • I wanted to parse an hex color in a RGB[A] format, and have the fastest execution. – tito Feb 08 '12 at 11:58
  • 1
    @tito: You should probably ask exactly this in a new question, because the answers will be completely different to the answers in this question. (Hint: use `"aabbcc".decode("hex")` together with `struct.unpack()`.) – Sven Marnach Feb 08 '12 at 12:01
  • I think the fastest would be to parse it as a single 32-bit (8-digit) hex number and then split the number into channels through bit shifts and masks, or modulo division. – SF. Feb 08 '12 at 12:04
  • 1
    Surely the fastest will be to maintain a hash table of all possible 6-digit hex strings in a C extension. This sounds like premature optimization to begin with, though. – Wooble Feb 08 '12 at 12:19
  • @tito: What if `re.split()` is fastest? Have you used `timeit` yet? – S.Lott Feb 08 '12 at 13:46

3 Answers3

4

Reading through the comments, it turns out the actual question is: What is the fastest way to parse a color definition string in hexadecimal RRGGBBAA format. Here are some options:

def rgba1(s, unpack=struct.unpack):
    return unpack("BBBB", s.decode("hex"))

def rgba2(s, int=int, xrange=xrange):
    return [int(s[i:i+2], 16) for i in xrange(0, 8, 2)]

def rgba3(s, int=int, xrange=xrange):
    x = int(s, 16)
    return [(x >> i) & 255 for i in xrange(0, 32, 8)]

As I expected, the first version turns out to be fastest:

In [6]: timeit rgba1("aabbccdd")
1000000 loops, best of 3: 1.44 us per loop

In [7]: timeit rgba2("aabbccdd")
100000 loops, best of 3: 2.43 us per loop

In [8]: timeit rgba3("aabbccdd")
100000 loops, best of 3: 2.44 us per loop
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
1
In [4]: ["".join(pair) for pair in zip(* 2 * [iter(s)])]
Out[4]: ['aa', 'bb', 'cc']

See: How does zip(*[iter(s)]*n) work in Python? for explanations as to that strange "2-iter over the same str" syntax.


You say in the comments that you want to "have the fastest execution", I can't promise you that with this implementation, but you can measure the execution using timeit. Remember what Donald Knuth said about premature optimisation, of course. For the problem at hand (now that you've revealed it) I think you'd find r, g, b = s[0:2], s[2:4], s[4:6] hard to beat.

$ python3.2 -m timeit -c '
s = "aabbcc"
["".join(pair) for pair in zip(* 2 * [iter(s)])]
'
100000 loops, best of 3: 4.49 usec per loop

Cf.

python3.2 -m timeit -c '
s = "aabbcc"
r, g, b = s[0:2], s[2:4], s[4:6]
'
1000000 loops, best of 3: 1.2 usec per loop
Community
  • 1
  • 1
johnsyweb
  • 136,902
  • 23
  • 188
  • 247
0

Numpy is worse than your preferred solution for a single lookup:

$ python -m timeit -s 'import numpy as np; s="aabbccdd"' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; list(a)'
100000 loops, best of 3: 5.14 usec per loop
$ python -m timeit -s 's="aabbcc";' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
100000 loops, best of 3: 2.41 usec per loop

But if you do several conversions at once, numpy is much faster:

$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.tolist()'
10000 loops, best of 3: 59.6 usec per loop
$ python -m timeit -s 's="aabbccdd" * 100;' '[int(s[i:i+2], 16) / 255. for i in xrange(0, len(s), 2)]'
1000 loops, best of 3: 240 usec per loop

Numpy is faster for batcher larger than 2, on my computer. You can easily group the values by setting a.shape to (number_of_colors, 4), though it makes the tolist method 50% slower.

In fact, most of the time is spent converting the array to a list. Depending on what you wish to do with the results, you may be able to skip this intermeditary step, and reap some benefits:

$ python -m timeit -s 'import numpy as np; s="aabbccdd" * 100' 'a = np.fromstring(s.decode("hex"), dtype="uint32"); a.dtype = "uint8"; a.shape = (100,4)'
100000 loops, best of 3: 6.76 usec per loop
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
Lauritz V. Thaulow
  • 49,139
  • 12
  • 73
  • 92