Variable/unknown length string/unicode dtype in numpy

Question

Is it possible to somehow load an array with a text field of unknown field length?

I figured out how to pass dtype to get string into it. However, with out specifying length i just get U0. Type which seems not to be able to hold any data. E.g:

data = io.StringIO("test data lololol\ntest2 d4t4 ololol")
>>> ar = numpy.loadtxt(data, dtype=[("1",str), ("2",'S'), ("3",'S')])
>>> ar
array([('', b'', b''), ('', b'', b'')], 
      dtype=[('1', '<U0'), ('2', '|S0'), ('3', '|S0')])

When I change to mode with specified size I get input:

>>> data.seek(0)
0
>>> numpy.loadtxt(data, dtype=[("1",(str,30)), ("2",(str,30)), ("3",('S',30))])
array([("b'test'", "b'data'", b'lololol'),
       ("b'test2'", "b'd4t4'", b'ololol')], 
      dtype=[('1', '<U30'), ('2', '<U30'), ('3', '|S30')])

I'd be fine with either S or U probably. The field in my case is supposed to be used to hold set of textual flags. Something like linux environmental variables. Thus, preallocating large space just in case seems like a big waste. Especially when number of rows goes into millions.

I do understand, or have ideas, where such design can come from. Like constructing a struct like object that holds whole row in continuous memory block. However, I thought maybe there could a way to make it keep like a pointer in case of strings.

Is it possible?

Not the task numpy is suited for. You can for example encode your strings (i.e. manually construct pointers) with say `hash` function, and store theme elsewhere. — alko, Dec 17 '13 at 22:51

score 1 · Accepted Answer · edited May 23 '17 at 11:52

1

getting indices in numpy uses np.recfromtxt, which can generate the dtype automatically. Effectively it calls np.genfromtxt with a dtype=None.

Data like:

david weight_2005 50
david weight_2012 60
david height_2005 150
david height_2012 160

produces a:

array([('david', 'weight_2005', 50), ('david', 'weight_2012', 60),
       ('david', 'height_2005', 150), ('david', 'height_2012', 160),...], 
      dtype=[('f0', 'S5'), ('f1', 'S11'), ('f2', '<i4')])

The code in genfromtxt for determining dtype looks complex. My guess it adjusts the Snn to accommodate the longest string that it encounters in that field.

One way to customize the dtype is to assign names in getnfromtxt, and recast the values after with astype.

x=np.genfromtxt('stack19944408.txt',dtype=None,names=['one','two','thr'])
x.astype(dtype=[('one','S10'),('two','S10'),('thr','f')])
#array([('david', 'weight_200', 50.0), ('david', 'weight_201', 60.0),
#       ...
#      dtype=[('one', 'S10'), ('two', 'S10'), ('thr', '<f4')])

edited May 23 '17 at 11:52

Community

1
1

answered Dec 18 '13 at 06:57

hpaulj

221,503
14
230
353

I see there is no simple solution. Your solution kinda works, but it's like I was afraid, it needs to preallocate space regardless of the actual data. My problem is that the field can be from around `10` to `100` characters long, this goes for 10^6 to 10^8 rows. That's why I don't like it. I accepted your solution because for smaller data sets it's probably ok and works automatically, which I like. I personally coded flags into binary and provided `str <=> uint` dictionary mappings to decode them later. This way I had some extra work, but save a lot of space. – luk32 Dec 20 '13 at 11:42
So you are worried that if `genfromtxt` chooses `S100` to accommodate your longest record field, that there will be a lot of blanks in the other records? I think that has to be the case if the strings are stored in the array itself (with constant record size). The alternative is `object` `dtype`, with the strings stored as regular Python strings (and just pointers in the array). – hpaulj Dec 20 '13 at 23:02

Variable/unknown length string/unicode dtype in numpy

1 Answers1

Linked