Calling preprocessing.scale on a heterogeneous array

Question

I have this TypeError as per below, I have checked my df and it all contains numbers only, can this be caused when I converted to numpy array? After the conversion the array has items like

[Timestamp('1993-02-11 00:00:00') 28.1216 28.3374 ...]

Any suggestion how to solve this, please?

df:           
    Date      Open      High       Low     Close     Volume  
9    1993-02-11   28.1216   28.3374   28.1216   28.2197      19500  
10   1993-02-12   28.1804   28.1804   28.0038   28.0038      42500  
11   1993-02-16   27.9253   27.9253   27.2581   27.2974     374800  
12   1993-02-17   27.2974   27.3366   27.1796   27.2777     210900  

X = np.array(df.drop(['High'], 1))
X = preprocessing.scale(X)

TypeError: float() argument must be a string or a number

Well, the error is self-explanatory: it can't deal with datetime-objects or whatever your dates are. — sascha, Oct 18 '17 at 15:28

score 0 · Answer 1 · answered Oct 21 '17 at 00:32

While you're saying that your dataframe "all contains numbers only", you also note that the first column consists of datetime objects. The error is telling you that preprocessing.scale only wants to work with float values.

The real question, however, is what you expect to happen to begin with. preprocessing.scale centers values on the mean and normalizes the variance. This is such that measured quantities are all represented on roughly the same footing. Now, your first column tells you what dates your data correspond to, while the rest of the columns are numeric data themselves. Why would you want to normalize the dates? How would you normalize the dates?

Semantically speaking, I believe you should leave your dates alone. Whatever post-processing you're planning to perform on your numerical data, the normalized data should still be parameterized by the original dates. If you want to process your dates too, you need to come up with an explicit way to handle your dates to something numeric (say, elapsed time from a given date in given units).

So I believe you should drop your dates from your processing round altogether, and start with

X = df.drop(['Date','High'], 1).as_matrix()

Calling preprocessing.scale on a heterogeneous array

1 Answers1