Convert structured array to numpy array for use with Scikit-Learn

Question

I'm having difficulty converting a structured array loaded from a CSV using np.genfromtxt into a np.array in order to fit the data to a Scikit-Learn estimator. The problem is that at some point a cast from the structured array to a regular array will occur resulting in a ValueError: can't cast from structure to non-structure. For a long time, I had been using .view to perform the conversion but this has resulted in a number of deprecation warnings from NumPy. The code is as follows:

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

data = np.genfromtxt(path, dtype=float, delimiter=',', names=True)

target = "occupancy"
features = [
    "temperature", "relative_humidity", "light", "C02", "humidity"
]

# Doesn't work directly
X = data[features]
y = data[target].astype(int)

clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)

The exception being raised is: ValueError: Can't cast from structure to non-structure, except if the structure only has a single field.

My second attempt was to use a view as follows:

# View is raising deprecation warnings
X = data[features]
X = X.view((float, len(X.dtype.names)))
y = data[target].astype(int)

Which works and does exactly what I want it to do (I don't need a copy of the data), but results in deprecation warnings:

FutureWarning: Numpy has detected that you may be viewing or writing to 
an array returned by selecting multiple fields in a structured array.

This code may break in numpy 1.15 because this will return a view 
instead of a copy -- see release notes for details.

At the moment we're using tolist() to convert the structured array to a list and then to a np.array. This works, however it seems terribly inefficient:

# Current method (efficient?)
X = np.array(data[features].tolist())
y = data[target].astype(int)

There has to be a better way, I'd appreciate any advice.

NOTE: The data for this example is from the UCI ML Occupancy Repository and the data appears as follows:

array([(nan, 23.18, 27.272 , 426.  ,  721.25, 0.00479299, 1.),
       (nan, 23.15, 27.2675, 429.5 ,  714.  , 0.00478344, 1.),
       (nan, 23.15, 27.245 , 426.  ,  713.5 , 0.00477946, 1.), ...,
       (nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
       (nan, 20.89, 28.0225, 418.75, 1632.  , 0.00427949, 1.),
       (nan, 21.  , 28.1   , 409.  , 1864.  , 0.00432073, 1.)],
      dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'), 
             ('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])

Related: [Using np.view() with changes to structured arrays in numpy 1.14](https://stackoverflow.com/q/48267058/190597) and [this github issue](https://github.com/numpy/numpy/issues/10409) — unutbu, Mar 03 '18 at 15:15

score 3 · Accepted Answer · answered Mar 03 '18 at 15:20

3

Add a .copy() to data[features]:

X = data[features].copy()
X = X.view((float, len(X.dtype.names)))

and the FutureWarning message is gone.

This should be more efficient than converting to a list first.

answered Mar 03 '18 at 15:20

Mike Müller

82,630
20
166
161

3

This feels like the solution we're going to use - but it's a shame to have to make a copy of a potentially large dataset just to use the view on that. – bbengfort Mar 04 '18 at 20:42

unutbu · Answer 2 · 2018-03-04T18:51:49.203

You could avoid the need for copying if you can read the data into a plain NumPy array first (by omitting the names parameter):

data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)

Then (lucky for us), X is composed of all but the first and last columns (i.e. omitting the datetime and occupancy columns). So we can express X and y as slices:

X = data[:, 1:-1]
y = data[:, -1].astype(int)

Then we can pass these to scikit-learn functions easily:

clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)

and, if we wish, we can view the plain NumPy array as a structured array afterwards:

features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])

Unfortunately, this workaround relies on X being expressible as a slice -- we wouldn't be able to avoid copying if occupancy showed up in between the other feature colums for instance. It also means you have to define X using X = data[:, 1:-1] instead of the more humanly-understandable X = data[features].

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier

data = np.genfromtxt(path, dtype=float, delimiter=',', skip_header=1)

X = data[:, 1:-1]
y = data[:, -1].astype(int)

clf = GradientBoostingClassifier(random_state=42)
clf.fit(X, y)

features = ["temperature", "relative_humidity", "light", "C02", "humidity"]
X = X.ravel().view([(field, X.dtype.type) for field in features])

If you must start with the structured array, then hpaulj's answer shows how to view/reshape/slice the structured array to obtain a plain array without copying:

import numpy as np
nan = np.nan
data = np.array([(nan, 23.18, 27.272 , 426.  ,  721.25, 0.00479299, 1.),
       (nan, 23.15, 27.2675, 429.5 ,  714.  , 0.00478344, 1.),
       (nan, 23.15, 27.245 , 426.  ,  713.5 , 0.00477946, 1.), 
       (nan, 20.89, 27.745 , 423.5 , 1521.5 , 0.00423682, 1.),
       (nan, 20.89, 28.0225, 418.75, 1632.  , 0.00427949, 1.),
       (nan, 21.  , 28.1   , 409.  , 1864.  , 0.00432073, 1.)],
      dtype=[('datetime', '<f8'), ('temperature', '<f8'), ('relative_humidity', '<f8'), 
             ('light', '<f8'), ('C02', '<f8'), ('humidity', '<f8'), ('occupancy', '<f8')])

target = 'occupancy'
nrows = len(data)
X = data.view('<f8').reshape(nrows, -1)[:, 1:-1]
y = data[target].astype(int)

This takes advantage of the fact that each field is 8 bytes long. So it is easy to convert the structured array to a plain array of dtype <f8. Reshaping makes it a 2D array with the same number of rows. Slicing removes the datetime and occupancy column/fields from the array.

Thanks for your answer! Unfortunately, we do need a solution to handle structured arrays as we don't have any control of the input type; using `np.genfromtxt` was just to illustrate the problem. — bbengfort, Mar 04 '18 at 13:42
In that case, [hpaulj's answer](https://stackoverflow.com/a/48269334/190597) shows how to `view/reshape/slice` the structured array to obtain a plain array. I'll update my post to show how it could be applied in your case. — unutbu, Mar 04 '18 at 14:04
I appreciate all the hard work you've put into the answer. I just tried your solution using view/reshape/slice and unfortunately, I'm still getting the deprecation warnings. I'd prefer not to copy, but this looks like the only thing to do. I appreciate all the effort and thought you put into this though. — bbengfort, Mar 05 '18 at 15:11

Convert structured array to numpy array for use with Scikit-Learn

2 Answers2