9

I need to convert a large dataframe to a numpy array. Preserving only numerical values and types. I know there are well documented ways to do so.

So, which one is to prefer?

df.values
df._as_matrix()
pd.to_numeric(df)
... others ...

Decision factor:

  • efficiency

  • safely operating on nan,np.nans, and other possible unexpected values

  • numerically stable

jpp
  • 159,742
  • 34
  • 281
  • 339
00__00__00
  • 4,834
  • 9
  • 41
  • 89
  • 2
    Posters seem to have most problems when the dataframe contains mixed items and the dtype for a column, or the frame as whole is `object`. It seems that pandas readily switches to `object` to accommodate strings and `nan` (floats). `numpy` on the other hand uses `object` to handle sublists of varying size. – hpaulj Mar 08 '18 at 19:01

2 Answers2

15

The functions you mention serve different purposes.

  1. pd.to_numeric: Use this to convert types in your dataframe if your data is not currently stored in numeric form or if you wish to cast as an optimal type via downcast='float' or downcast='integer'.

  2. pd.DataFrame.to_numpy() (v0.24+) or pd.DataFrame.values: Use this to retrieve numpy array representation of your dataframe.

  3. pd.DataFrame.as_matrix: Do not use this. It is included only for backwards compatibility.

joelostblom
  • 43,590
  • 17
  • 150
  • 159
jpp
  • 159,742
  • 34
  • 281
  • 339
7

Under the hood, a pandas.DataFrame is not much more than a numpy.array. The simplest and possibly fastest way is to use pandas.DataFrame.values

DataFrame.values

Numpy representation of NDFrame

Notes

The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.

e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type convention, mixing int64 and uint64 will result in a flot64 dtype.

ascripter
  • 5,665
  • 12
  • 45
  • 68
  • 1
    the precisation about dtypes is exactly the kind of warning I was looking for – 00__00__00 Mar 08 '18 at 18:32
  • 2
    Pandas 0.24 documentation says do not use .values() anymore, either .array or .to_numpy(). See https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html, "In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend avoiding .values and using .array or .to_numpy()." – pauljohn32 Mar 18 '19 at 18:45
  • @ pauljohn32 could you make it an answer?very useful – 00__00__00 Nov 13 '19 at 16:50