0

I have a complicated set of data that I have to do distance calculations on. Each record in the data set contains many different data types so a record array or structured array appears to be the way to go. The problem is when I have to do my distance calculations, the scipy spatial distance functions take arrays and the recored array is numpy voids. How to I make a recored array of numpy arrays instead of numpy voids? Below is a very simple example of what I'm talking about.

import numpy
import scipy.spatial.distance as scidist


input_data = [
    ('340.9', '7548.2', '1192.4', 'set001.txt'),
    ('546.7', '9039.9', '5546.1', 'set002.txt'),
    ('456.3', '2234.8', '2198.8', 'set003.txt'),
    ('332.1', '1144.2', '2344.5', 'set004.txt'),
]

record_array = numpy.array(input_data,
                           dtype=[('d1', 'float64'), ('d2', 'float64'), ('d3', 'float64'), ('file', '|S20')])

The following code fails...

this_fails_and_makes_me_cry = record_array[['d1', 'd2', 'd3']]
scidist.pdist(this_fails_and_makes_me_cry)

I get this error....

Traceback (most recent call last):
  File "/home/someguy/working_datasets/trial003/scrap.py", line 16, in <module>
    scidist.pdist(record_array[['d1', 'd2', 'd3']])
  File "/usr/lib/python2.7/dist-packages/scipy/spatial/distance.py", line 1093, in pdist
    raise ValueError('A 2-dimensional array must be passed.');
ValueError: A 2-dimensional array must be passed.

The error occurs because this_fails_and_makes_me_cry is an array of numpy.voids. To get it to work I have to convert each time like this...

this_works = numpy.array(map(list, record_array[['d1', 'd2', 'd3']]))
scidist.pdist(this_works)

Is it possible to create a record array of numpy arrays to begin with? Or is a numpy record/structured array restricted to numpy voids? It would be handy for the record array to contain the data in a format compatible with scipy's spatial distance functions so that I don't have to convert each time. Is this possible?

b10hazard
  • 7,399
  • 11
  • 40
  • 53
  • My understanding is that Numpy structured arrays can only contain fields of discrete types (plus fixed lenght strings), so no, you cannot store an array. You could turn that conversion into a function to make it easier... and use some standard way to convert the data to a 2D array (like `array.view`), [see here](http://stackoverflow.com/questions/5957380/convert-structured-array-to-regular-numpy-array) – Ricardo Cárdenes Aug 13 '14 at 13:03
  • Bummer. I was hoping that wasn't the case because I have to do this a TON of times due to the large number of distance calculations and the large data set that I have. Thanks for the link. – b10hazard Aug 13 '14 at 13:45

1 Answers1

3
this_fails_and_makes_me_cry = record_array[['d1', 'd2', 'd3']]

creates a one-dimensional structured array, with fields d1, d2 and d3. pdist expects a two-dimensional array. Here's one way to create that two-dimensional array containing only the d fields of record_array.

(Note: The following won't work if the fields that you want to use for the distance calculation are not contiguous within the data type of the structured array record_array. See below for an alternative in that case.)

First, we make a new dtype, in which d1, d2 and d3 become a single field called d containing three floating point values:

In [61]: dt2 = dtype([('d', 'f8', 3), ('file', 'S20')])

Next, use the view method to create a view of record_array using this dtype:

In [62]: rav = record_array.view(dt2)

In [63]: rav
Out[63]: 
array([([340.9, 7548.2, 1192.4], 'set001.txt'),
       ([546.7, 9039.9, 5546.1], 'set002.txt'),
       ([456.3, 2234.8, 2198.8], 'set003.txt'),
       ([332.1, 1144.2, 2344.5], 'set004.txt')], 
      dtype=[('d', '<f8', (3,)), ('file', 'S20')])

rav is not a copy--it is a view of the same block of memory used by record_array.

Now access field d to get the two-dimensional array:

In [64]: d = rav['d']

In [65]: d
Out[65]: 
array([[  340.9,  7548.2,  1192.4],
       [  546.7,  9039.9,  5546.1],
       [  456.3,  2234.8,  2198.8],
       [  332.1,  1144.2,  2344.5]])

d can be passed to pdist:

In [66]: pdist(d)
Out[66]: 
array([ 4606.75875427,  5409.10137454,  6506.81395539,  7584.32432455,
        8522.8149229 ,  1107.27706108])

Note that instead of converting record_array to rav, you could use dt2 as the data type of record_array from the start, and just write d = record_array['d'].


If the fields in record_array that are used for the distance calculation are not contiguous in the structure, you'll first have to pull them out into a new array so they are contiguous:

In [83]: arr = record_array[['d1','d2','d3']]

Then take a view of arr and reshape to make it two-dimensional:

In [84]: d = arr.view(np.float64).reshape(-1,3)

In [85]: d
Out[85]: 
array([[  340.9,  7548.2,  1192.4],
       [  546.7,  9039.9,  5546.1],
       [  456.3,  2234.8,  2198.8],
       [  332.1,  1144.2,  2344.5]])

You can combine those into a single line, if that's more convenient:

In [86]: d = record_array[['d1', 'd2', 'd3']].view(np.float64).reshape(-1, 3)
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • That is very clever. I wasn't aware you could do this with numpy. So the view function is just a different way of formating an existing numpy object without creating a new one? – b10hazard Aug 13 '14 at 13:50
  • Yes; check out http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.view.html – Warren Weckesser Aug 13 '14 at 13:55
  • Thanks. Also, is it possible to view slices of an array? What if I wanted to create a view of the first two elements of the record array and another separate view of the last two elements without creating two new numpy objects? Is that possible? – b10hazard Aug 13 '14 at 14:11
  • Nevermind, I just realized numpy slices don't copy the object like python list slices do. – b10hazard Aug 13 '14 at 14:20