4

anyone can tell me what is the fastest way to translate this string array into a number array as below:

import numpy as np
strarray = np.array([["123456"], ["654321"]])

     to

numberarray = np.array([[1,2,3,4,5,6], [6,5,4,3,2,1]])

map str to list and then map str to int is too slow for a large array!

Please help!

zshtom
  • 245
  • 1
  • 3
  • 14
  • 3
    Possible duplicate of [How to convert an array of strings to an array of floats in numpy?](http://stackoverflow.com/questions/3877209/how-to-convert-an-array-of-strings-to-an-array-of-floats-in-numpy) – idjaw Feb 24 '16 at 13:22
  • 2
    Is this typo? ["12456"] -> [1,2,**3**,4,5,6] – Ian Feb 24 '16 at 13:22
  • Are all elements guaranteed to have the same length (like it's 6 in the sample case)? – Divakar Feb 24 '16 at 13:32
  • To lan: Yes, that is a typo!already correct that! – zshtom Feb 24 '16 at 13:58
  • To Divakar: Yes, guaranteed to have the same length!! – zshtom Feb 24 '16 at 13:59
  • To idjaw: that is different from my question. I want to separate a string (no delimiter) array into a number array, the original string is quite long(320 digits), I am seeking an efficient way to do this kind of translation – zshtom Feb 24 '16 at 14:01

2 Answers2

3

You can split the strings into single characters with the array view method:

In [18]: strarray = np.array([[b"123456"], [b"654321"]])

In [19]: strarray.dtype
Out[19]: dtype('S6')

In [20]: strarray.view('S1')
Out[20]: 
array([['1', '2', '3', '4', '5', '6'],
       ['6', '5', '4', '3', '2', '1']], 
      dtype='|S1')

See here for data type character codes.

Then the most obvious next step is to use astype:

In [23]: strarray.view('S1').astype(int)
Out[23]: 
array([[1, 2, 3, 4, 5, 6],
       [6, 5, 4, 3, 2, 1]])

However, it's a lot faster to reinterpret (view) the memory underlying the strings as single byte integers and subtract 48. This works because ASCII characters take up a single byte and the characters '0' through '9' are binary equivalent to (u)int8's 48 through 57 (check the ord builtin).

Speed comparison:

In [26]: ar = np.array([[''.join(np.random.choice(list('123456789'), size=320))] for _ in range(1000)], bytes)

In [27]: %timeit _ = ar.view('S1').astype(np.uint8)
1 loops, best of 3: 284 ms per loop

In [28]: %timeit _ = ar.view(np.uint8) - ord('0')
1000 loops, best of 3: 1.07 ms per loop

If have Unicode instead of ASCII you need to do these steps slightly different. Or just convert to ASCII first with astype(bytes).

  • Could be a version issue, I am getting unicode for `strarray.dtype`. I am on Python 3.4. And `ar.view('S1')` has `"b'"` all over alongwith the strings themselves. – Divakar Feb 24 '16 at 17:06
  • @Divakar - I changed the strings to bytes for Python 3 compatibility. –  Feb 24 '16 at 17:16
  • But if OP has those as strings, he/she has to convert to byte first, right? How could that be done? – Divakar Feb 24 '16 at 17:18
  • @Divakar - Python 2.x has ASCII strings as default and for those it works. –  Feb 24 '16 at 17:22
  • Ah yes you have mentioned `.astype(bytes)` for the conversion in the post! Nice, works for me now. – Divakar Feb 24 '16 at 17:30
  • @morningsun Great solution! Thank you very much! – zshtom Feb 25 '16 at 14:43
0

Here's an approach that converts the input strings to N-length numeric arrays, i.e. each string gets converted to a 1D array of length N, where N is the length of each of those strings. The approach suggested here basically converts the string to their int equivalents and then gets all the digits using differentiation from their preceding elements' power-10 scaled version. The implementation looks like this -

A = (strarray.astype(int)/(10**np.arange(len(strarray[0][0])))).astype(int)
out = np.column_stack((A[:,-1],(A[:,:-1] - 10*A[:,1:])[:,::-1]))

Sample run -

In [177]: strarray  = np.array([["0308468"], ["6540542"], ["4973473"]])

In [178]: A = (strarray.astype(int)/(10**np.arange(len(strarray[0][0])))).astype(int)
     ...: out = np.column_stack((A[:,-1],(A[:,:-1] - 10*A[:,1:])[:,::-1]))
     ...: 

In [179]: out
Out[179]: 
array([[0, 3, 0, 8, 4, 6, 8],
       [6, 5, 4, 0, 5, 4, 2],
       [4, 9, 7, 3, 4, 7, 3]])
Divakar
  • 218,885
  • 19
  • 262
  • 358