2

I have a numpy bytes array containing characters, followed by b'', followed by others characters (including weird characters which raise Unicode errors when decoding):

bytes = numpy.array([b'f', b'o', b'o', b'', b'b', b'a', b'd', b'\xfe', b'\x95', b'', b'\x80', b'\x04', b'\x08' b'\x06'])

I want to get everything before the first b''.

Currently my code is:

txt = []
for c in bytes:
    if c != b'':
        txt.append(c.decode('utf-8'))
    else:
        break
txt = ''.join(txt)

I suppose there is a more efficient and Pythonic way to do that.

Cœur
  • 37,241
  • 25
  • 195
  • 267
roipoussiere
  • 5,142
  • 3
  • 28
  • 37

1 Answers1

4

I like your way, it is explicit, the for loop is understandable by all and it isn't all that slow compared to other approaches.

Some suggestions I'd make would be to change your condition from if c != b'' to if c since a non-empty byte object will be truthy and, *don't name your list bytes, you mask the built-in! Name it bt or something similar :-)

Other options include itertools.takewhile which will grab elements from an iterable as long as a predicate holds; your operation would look like:

"".join(s.decode('utf-8') for s in takewhile(bool, bt))

This is slightly slower but is more compact, if you're a one-liner lover this might appeal to you.

Slightly faster and also compact is using index along with a slice:

"".join(b.decode('utf-8') for b in bt[:bt.index(b'')])

While compact it also suffers from readability.

In short, I'd go with the for loop since readability counts as very pythonic in my eyes.

Dimitris Fasarakis Hilliard
  • 150,925
  • 31
  • 268
  • 253
  • Thanks for this advices! Oh in fact the byte array was a numpy array. I like your second solution, but I benchmarked the these 3 solutions (with `ba[:np.where(ba == b'')[0][0]]` instead of `ba[:ba.index(b'')])` and it appears that the for loop solution is faster, so I choosed it. – roipoussiere Aug 30 '16 at 12:22
  • @user2914540 oh I was unaware that it was a numpy array, maybe add the `numpy` tag and specify that `bytes` is a numpy array? There might be more efficient ways to do this in numpy. – Dimitris Fasarakis Hilliard Aug 30 '16 at 12:37
  • done. Sorry, this array comes from an external library (netcdf4py) and I discovered it was a numpy array by trying to do ab.index(). – roipoussiere Aug 30 '16 at 12:46