Programmatically add column names to numpy ndarray

Question

I'm trying to add column names to a numpy ndarray, then select columns by their names. But it doesn't work. I can't tell if the problem occurs when I add the names, or later when I try to call them.

Here's my code.

data = np.genfromtxt(csv_file, delimiter=',', dtype=np.float, skip_header=1)

#Add headers
csv_names = [ s.strip('"') for s in file(csv_file,'r').readline().strip().split(',')]
data = data.astype(np.dtype( [(n, 'float64') for n in csv_names] ))

Dimension-based diagnostics match what I expect:

print len(csv_names)
>> 108
print data.shape
>> (1652, 108)

"print data.dtype.names" also returns the expected output.

But when I start calling columns by their field names, screwy things happen. The "column" is still an array with 108 columns...

print data["EDUC"].shape
>> (1652, 108)

... and it appears to contain more missing values than there are rows in the data set.

print np.sum(np.isnan(data["EDUC"]))
>> 27976

Any idea what's going wrong here? Adding headers should be a trivial operation, but I've been fighting this bug for hours. Help!

score 15 · Accepted Answer · edited May 23 '17 at 12:08

15

The problem is that you are thinking in terms of spreadsheet-like arrays, whereas NumPy does use different concepts.

Here is what you must know about NumPy:

NumPy arrays only contain elements of a single type.
If you need spreadsheet-like "columns", this type must be some tuple-like type. Such arrays are called Structured Arrays, because their elements are structures (i.e. tuples).

In your case, NumPy would thus take your 2-dimensional regular array and produce a one-dimensional array whose type is a 108-element tuple (the spreadsheet array that you are thinking of is 2-dimensional).

These choices were probably made for efficiency reasons: all the elements of an array have the same type and therefore have the same size: they can be accessed, at a low-level, very simply and quickly.

Now, as user545424 showed, there is a simple NumPy answer to what you want to do (genfromtxt() accepts a names argument with column names).

If you want to convert your array from a regular NumPy ndarray to a structured array, you can do:

data.view(dtype=[(n, 'float64') for n in csv_names]).reshape(len(data))

(you were close: you used astype() instead of view()).

You can also check the answers to quite a few Stackoverflow questions, including Converting a 2D numpy array to a structured array and how to convert regular numpy array to record array?.

edited May 23 '17 at 12:08

Community

1
1

answered May 25 '12 at 08:07

Eric O. Lebigot

91,433
48
218
260

Thanks -- this helps clear things up conceptually. But I still have some questions about this particular case. Here, all of my columns are floats, and I'm going to be doing a lot of matrix multiplication, so I want to keep the 2d-array structure -- no need for structured array. All I want to do is add field names. Is that possible? – Abe May 25 '12 at 13:01
NB: genfromtxt imports the csv in numpy's structured tuple format. I tried everything I could think of to import field names in array format, and nothing worked. – Abe May 25 '12 at 13:03
@Abe: You can still perform matrix multiplications: the `view()` is simply another way to look at the *same* data. So, you can work with both the original data array and the `view()`ed array at the same time (the first array is 2D, the second is 1D and structured). – Eric O. Lebigot May 26 '12 at 04:10
@Abe: About your 2nd question: you *cannot* have "field names in (2D) array format". This concept is not valid in NumPy (this is a spreadsheet concept). You want either a non-structured/named-columns 2D array (your `data` array), or a 1D structured/named-columns version of it (the result of `view()` in my answer). I hope this will help clear things up. :) – Eric O. Lebigot May 26 '12 at 04:11
@Abe: Technically, I don't want to make things more complicated than they are, but note that you can have a 2D (or n-dimensional) structured array. However, each cell will contain a *tuple*. Example: `arr = zeros((3, 5), dtype=[('x', int), ('y', float)])`, with structure access like `a['x']`, which returns a 2D array of integers. – Eric O. Lebigot May 26 '12 at 08:07

user545424 · Answer 2 · 2012-05-24T18:21:29.010

3

Unfortunately, I don't know what is going on when you try to add the field names, but I do know that you can build the array you want directly from the file via

data = np.genfromtxt(csv_file, delimiter=',', names=True)

EDIT:

It seems like adding field names only works when the input is a list of tuples:

data = np.array(map(tuple,data), [(n, 'float64') for n in csv_names])

edited May 24 '12 at 18:21

answered May 24 '12 at 18:15

user545424

15,713
11
56
70

So is it the case that ndarrays can be referenced by field names if they are cast as tuples OR referenced by field id when cast as arrays---but never both? That seems to be the way it works, but I don't see anything like that in the documentation. – Abe May 24 '12 at 18:36
I'm starting to wonder if this is a bug. It's very strange behavior to have the array constructor act differently based on the type of the nested structure you pass in. – user545424 May 24 '12 at 18:51
@user545424: You can understand this behavior if you know the principles on which NumPy is based (you can for instance check my answer). In a nutshell: tuple() is a kind of "fundamental type" (like floats), for NumPy (so you get a kind of structured array, when you pass tuples), whereas passing lists or arrays as input means "add another dimension" to the array (you get an array of numbers, typically). – Eric O. Lebigot May 26 '12 at 04:13

Programmatically add column names to numpy ndarray

2 Answers2

Linked