7

I wonder what the best way of normalizing/standardizing a numpy recarray is. To make it clear, I'm not talking about a mathematical matrix, but a record array that also has e.g. textual columns (such as labels).

a = np.genfromtxt("iris.csv", delimiter=",", dtype=None)
print a.shape
> (150,)

As you can see, I cannot e.g. process a[:,:-1] as the shape is one-dimensional.

The best I found is to iterate over all columns:

for nam in a.dtype.names[:-1]:
    col = a[nam]
    a[nam] = (col - col.min()) / (col.max() - col.min())

Any more elegant way of doing this? Is there some method such as "normalize" or "standardize" somewhere?

jamylak
  • 128,818
  • 30
  • 231
  • 230
Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194

2 Answers2

7

There are a number of ways to do it, but some are cleaner than others.

Usually, in numpy, you keep the string data in a separate array.

(Things are a bit more low-level than, say, R's data frame. You typically just wrap things up in a class for the association, but keep different data types separate.)

Honestly, numpy isn't optimized for handling "flexible" datatypes such as this (though it can certainly do it). Things like pandas provide a better interface for "spreadsheet-like" data (and pandas is just a layer on top of numpy).

However, structured arrays (which is what you have here) will allow you to slice them column-wise when you pass in a list of field names. (e.g. data[['col1', 'col2', 'col3']])

At any rate, one way is to do something like this:

import numpy as np

data = np.recfromcsv('iris.csv')

# In this case, it's just all but the last, but we could be more general
# This must be a list and not a tuple, though.
float_fields = list(data.dtype.names[:-1])

float_dat = data[float_fields]

# Now we just need to view it as a "regular" 2D array...
float_dat = float_dat.view(np.float).reshape((data.size, -1))

# And we can normalize columns as usual.
normalized = (float_dat - float_dat.min(axis=0)) / float_dat.ptp(axis=0)

However, this is far from ideal. If you want to do the operation in-place (as you currently are) the easiest solution is what you already have: Just iterate over the field names.

Incidentally, using pandas, you'd do something like this:

import pandas
data = pandas.read_csv('iris.csv', header=None)

float_dat = data[data.columns[:-1]]
dmin, dmax = float_dat.min(axis=0), float_dat.max(axis=0)

data[data.columns[:-1]] = (float_dat - dmin) / (dmax - dmin)
Joe Kington
  • 275,208
  • 71
  • 604
  • 463
  • 1
    +1 Thank you. This is a very informative and insightful answer. Splitting the dataset into numerical and non-numerical columns is probably the way to go. This makes many other operations well-defined and is in fact what I was trying to do. I wasn't aware of the option of using `data[list]` to select multiple columns. – Has QUIT--Anony-Mousse Mar 20 '12 at 07:02
1

What version of NumPy are you using? With version 1.5.1, I don't get this behavior. I made a short text file as an example, saved as test.txt:

last,first,country,state,zip
tyson,mike,USA,Nevada,89146
brady,tom,USA,Massachusetts,02035

When I then execute the following code, this is what I get:

>>> import numpy as np
>>> a = np.genfromtxt("/home/ely/Desktop/Python/test.txt",delimiter=',',dtype=None)
>>> print a.shape
(3,5)
>>> print a
[['last' 'first' 'country' 'state' 'zip']
 ['tyson' 'mike' 'USA' 'Nevada' '89146']
 ['brady' 'tom' 'USA' 'Massachusetts' '02035']]
>>> print a[0,:-1]
['last' 'first' 'country' 'state']
>>> print a.dtype.names
None

I'm just wondering what's different about your data.

ely
  • 74,674
  • 34
  • 147
  • 228
  • Note: this was meant as a comment, not an answer... just needed more room to put in the example above. – ely Mar 19 '12 at 18:59
  • 1
    What's different is that you're getting a string array, not a structured array. Have a look at the dtype of `a` in your example. – Joe Kington Mar 19 '12 at 19:07
  • Sure, but what causes the incoming array to be 'structured'? If it's just a csv file, won't `genfromtxt()` always produce a string array? – ely Mar 19 '12 at 19:10
  • Nope. In your case, it's only creating a string array because the first row (the column names) are all strings. If you had numbers in any column in the first row, you'd get a structured array. In your case, if you specify `names=True`, the column names will be read from the first row, and you'll get a structured array. (See [here](http://docs.scipy.org/doc/numpy/user/basics.rec.html) for what I mean by "structured", by the way) – Joe Kington Mar 19 '12 at 19:22
  • @JoeKington probably got the difference. I have four purely numerical columns that I want to standardize, and one that has the labels. – Has QUIT--Anony-Mousse Mar 20 '12 at 07:01