18

Is it possible to initialise a numpy recarray that will hold strings, without knowing the length of the strings beforehand?

As a (contrived) example:

mydf = np.empty( (numrows,), dtype=[ ('file_name','STRING'), ('file_size_MB',float) ] )

The problem is that I'm constructing my recarray in advance of populating it with information, and I don't necessarily know the maximum length of file_name in advance.

All my attempts result in the string field being truncated:

>>> mydf = np.empty( (2,), dtype=[('file_name',str),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('', 6.9164002347457e-310), ('', 9.9413127e-317)], 
      dtype=[('file_name', 'S'), ('file_size_mb', '<f8')])
>>> mydf['file_name']
array(['f', 'a'], 
      dtype='|S1')

(As an aside, why does mydf['file_name'] show 'f' and 'a' whilst mydf shows '' and ''?)

Similarly, if I initialise with type (say) |S10 for file_name then things get truncated at length 10.

The only similar question I could find is this one, but this calculates the appropriate string length a priori and hence is not quite the same as mine (as I know nothing in advance).

Is there any alternative other than initalising the file_name with (eg) |S9999999999999 (ie some ridiculous upper limit)?

Community
  • 1
  • 1
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
  • this is such a good question. length 0 strings in recarrays just had me tearing out my hair for half an hour! – Christoph Jul 20 '12 at 11:49

1 Answers1

27

Instead of using the STRING dtype, one can always use object as dtype. That will allow any object to be assigned to an array element, including Python variable length strings. For example:

>>> import numpy as np
>>> mydf = np.empty( (2,), dtype=[('file_name',object),('file_size_mb',float)] )
>>> mydf['file_name'][0]='foobarasdf.tif'
>>> mydf['file_name'][1]='arghtidlsarbda.jpg'
>>> mydf
array([('foobarasdf.tif', 0.0), ('arghtidlsarbda.jpg', 0.0)], 
      dtype=[('file_name', '|O8'), ('file_size_mb', '<f8')])

It is a against the spirit of the array concept to have variable length elements, but this is as close as one can get. The idea of an array is that elements are stored in memory at well-defined and regularly spaced memory addresses, which prohibits variable length elements. By storing the pointers to a string in an array, one can circumvent this limitation. (This is basically what the above example does.)

  • thanks for that - I'm just shifting from the R language and basically wanted a data frame sort of object, and this works great! – mathematical.coffee Feb 03 '12 at 00:11
  • 4
    Late comment: if you are moving from R, consider the pandas.DataFrame object, which should seem very familiar to you and handles strings well. – mdurant May 29 '15 at 15:14