14

I'm trying to add column names to a numpy ndarray, then select columns by their names. But it doesn't work. I can't tell if the problem occurs when I add the names, or later when I try to call them.

Here's my code.

data = np.genfromtxt(csv_file, delimiter=',', dtype=np.float, skip_header=1)

#Add headers
csv_names = [ s.strip('"') for s in file(csv_file,'r').readline().strip().split(',')]
data = data.astype(np.dtype( [(n, 'float64') for n in csv_names] ))

Dimension-based diagnostics match what I expect:

print len(csv_names)
>> 108
print data.shape
>> (1652, 108)

"print data.dtype.names" also returns the expected output.

But when I start calling columns by their field names, screwy things happen. The "column" is still an array with 108 columns...

print data["EDUC"].shape
>> (1652, 108)

... and it appears to contain more missing values than there are rows in the data set.

print np.sum(np.isnan(data["EDUC"]))
>> 27976

Any idea what's going wrong here? Adding headers should be a trivial operation, but I've been fighting this bug for hours. Help!

Saullo G. P. Castro
  • 56,802
  • 26
  • 179
  • 234
Abe
  • 22,738
  • 26
  • 82
  • 111

2 Answers2

15

The problem is that you are thinking in terms of spreadsheet-like arrays, whereas NumPy does use different concepts.

Here is what you must know about NumPy:

  1. NumPy arrays only contain elements of a single type.
  2. If you need spreadsheet-like "columns", this type must be some tuple-like type. Such arrays are called Structured Arrays, because their elements are structures (i.e. tuples).

In your case, NumPy would thus take your 2-dimensional regular array and produce a one-dimensional array whose type is a 108-element tuple (the spreadsheet array that you are thinking of is 2-dimensional).

These choices were probably made for efficiency reasons: all the elements of an array have the same type and therefore have the same size: they can be accessed, at a low-level, very simply and quickly.

Now, as user545424 showed, there is a simple NumPy answer to what you want to do (genfromtxt() accepts a names argument with column names).

If you want to convert your array from a regular NumPy ndarray to a structured array, you can do:

data.view(dtype=[(n, 'float64') for n in csv_names]).reshape(len(data))

(you were close: you used astype() instead of view()).

You can also check the answers to quite a few Stackoverflow questions, including Converting a 2D numpy array to a structured array and how to convert regular numpy array to record array?.

Community
  • 1
  • 1
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • Thanks -- this helps clear things up conceptually. But I still have some questions about this particular case. Here, all of my columns are floats, and I'm going to be doing a lot of matrix multiplication, so I want to keep the 2d-array structure -- no need for structured array. All I want to do is add field names. Is that possible? – Abe May 25 '12 at 13:01
  • NB: genfromtxt imports the csv in numpy's structured tuple format. I tried everything I could think of to import field names in array format, and nothing worked. – Abe May 25 '12 at 13:03
  • @Abe: You can still perform matrix multiplications: the `view()` is simply another way to look at the *same* data. So, you can work with both the original data array and the `view()`ed array at the same time (the first array is 2D, the second is 1D and structured). – Eric O. Lebigot May 26 '12 at 04:10
  • @Abe: About your 2nd question: you *cannot* have "field names in (2D) array format". This concept is not valid in NumPy (this is a spreadsheet concept). You want either a non-structured/named-columns 2D array (your `data` array), or a 1D structured/named-columns version of it (the result of `view()` in my answer). I hope this will help clear things up. :) – Eric O. Lebigot May 26 '12 at 04:11
  • @Abe: Technically, I don't want to make things more complicated than they are, but note that you can have a 2D (or n-dimensional) structured array. However, each cell will contain a *tuple*. Example: `arr = zeros((3, 5), dtype=[('x', int), ('y', float)])`, with structure access like `a['x']`, which returns a 2D array of integers. – Eric O. Lebigot May 26 '12 at 08:07
3

Unfortunately, I don't know what is going on when you try to add the field names, but I do know that you can build the array you want directly from the file via

data = np.genfromtxt(csv_file, delimiter=',', names=True)

EDIT:

It seems like adding field names only works when the input is a list of tuples:

data = np.array(map(tuple,data), [(n, 'float64') for n in csv_names])
user545424
  • 15,713
  • 11
  • 56
  • 70
  • So is it the case that ndarrays can be referenced by field names if they are cast as tuples OR referenced by field id when cast as arrays---but never both? That seems to be the way it works, but I don't see anything like that in the documentation. – Abe May 24 '12 at 18:36
  • I'm starting to wonder if this is a bug. It's very strange behavior to have the array constructor act differently based on the type of the nested structure you pass in. – user545424 May 24 '12 at 18:51
  • @user545424: You can understand this behavior if you know the principles on which NumPy is based (you can for instance check my answer). In a nutshell: tuple() is a kind of "fundamental type" (like floats), for NumPy (so you get a kind of structured array, when you pass tuples), whereas passing lists or arrays as input means "add another dimension" to the array (you get an array of numbers, typically). – Eric O. Lebigot May 26 '12 at 04:13