Efficient way to split a bytes array then convert it to string in Python

Question

I have a numpy bytes array containing characters, followed by b'', followed by others characters (including weird characters which raise Unicode errors when decoding):

bytes = numpy.array([b'f', b'o', b'o', b'', b'b', b'a', b'd', b'\xfe', b'\x95', b'', b'\x80', b'\x04', b'\x08' b'\x06'])

I want to get everything before the first b''.

Currently my code is:

txt = []
for c in bytes:
    if c != b'':
        txt.append(c.decode('utf-8'))
    else:
        break
txt = ''.join(txt)

I suppose there is a more efficient and Pythonic way to do that.

By no means a duplicate but I think you are looking for something like this http://stackoverflow.com/q/432112/2988730 — Mad Physicist, Aug 30 '16 at 13:03

score 4 · Accepted Answer · answered Aug 30 '16 at 11:10

4

I like your way, it is explicit, the for loop is understandable by all and it isn't all that slow compared to other approaches.

Some suggestions I'd make would be to change your condition from if c != b'' to if c since a non-empty byte object will be truthy and, *don't name your list bytes, you mask the built-in! Name it bt or something similar :-)

Other options include itertools.takewhile which will grab elements from an iterable as long as a predicate holds; your operation would look like:

"".join(s.decode('utf-8') for s in takewhile(bool, bt))

This is slightly slower but is more compact, if you're a one-liner lover this might appeal to you.

Slightly faster and also compact is using index along with a slice:

"".join(b.decode('utf-8') for b in bt[:bt.index(b'')])

While compact it also suffers from readability.

In short, I'd go with the for loop since readability counts as very pythonic in my eyes.

answered Aug 30 '16 at 11:10

Dimitris Fasarakis Hilliard

150,925
31
268
253

Thanks for this advices! Oh in fact the byte array was a numpy array. I like your second solution, but I benchmarked the these 3 solutions (with `ba[:np.where(ba == b'')[0][0]]` instead of `ba[:ba.index(b'')])` and it appears that the for loop solution is faster, so I choosed it. – roipoussiere Aug 30 '16 at 12:22
@user2914540 oh I was unaware that it was a numpy array, maybe add the `numpy` tag and specify that `bytes` is a numpy array? There might be more efficient ways to do this in numpy. – Dimitris Fasarakis Hilliard Aug 30 '16 at 12:37
done. Sorry, this array comes from an external library (netcdf4py) and I discovered it was a numpy array by trying to do ab.index(). – roipoussiere Aug 30 '16 at 12:46

Efficient way to split a bytes array then convert it to string in Python

1 Answers1