15

According to this post, I should be able to access the names of columns in an ndarray as a.dtype.names

Howevever, if I convert a pandas DataFrame to an ndarray with df.as_matrix() or df.values, then the dtype.names field is None. Additionally, if I try to assign column names to the ndarray

X = pd.DataFrame(dict(age=[40., 50., 60.], sys_blood_pressure=[140.,150.,160.]))
print X
print type(X.as_matrix())# <type 'numpy.ndarray'>
print type(X.as_matrix()[0]) # <type 'numpy.ndarray'>

m = X.as_matrix()
m.dtype.names = list(X.columns)

I get

ValueError: there are no fields defined

UPDATE:

I'm particularly interested in the cases where the matrix only needs to hold a single type (it is an ndarray of a specific numeric type), since I'd also like to use cython for optimization. (I suspect numpy records and structured arrays are more difficult to deal with since they're more freely typed.)

Really, I'd just like to maintain the column_name meta data for arrays passed through a deep tree of sci-kit predictors. Its interface's .fit(X,y) and .predict(X) API don't permit passing additional meta-data about the column labels outside of the X and y object.

Community
  • 1
  • 1
user48956
  • 14,850
  • 19
  • 93
  • 154
  • `X.as_matrix()` is probably producing a uniform array, all int or float. Especially if all columns have the same type. `dtype.names` as described in the link applies to a structured array, one with a compound `dtype`. Does pandas have anything about creating a structured array? – hpaulj Nov 11 '16 at 19:04
  • what are you going to do with those column names? Your question looks like a ["XY problem"](http://meta.stackexchange.com/a/66378)... – MaxU - stand with Ukraine Nov 11 '16 at 20:09
  • You should show `X` (or at least a portion), as well as `X.as_matrix().shape` and `X.as_matrix().dtype`. – hpaulj Nov 11 '16 at 21:18
  • @MaxU - I'd like to track column name past as input to scikit predictors. Some predictors filter the data by removing some columns -- its helpful to be able to track the column names. (For example I might like to visualize a decision tree deeply nested in a set of predictors. What does column 3 represent?) – user48956 Nov 11 '16 at 21:47
  • Plus, I'd like to use numpy (vs pandas) for a variety of performance reasons (e.g. easy use of cython) -- its easier to keep the data as ndarrays, except that the interfaces to scikit (.fit(X,y), .predict(X) ) don't permit passing additional column-name meta-data that's not in the X or y objects. – user48956 Nov 11 '16 at 21:51
  • If `X` is supposed to be a 2d float array, then you have to get names from some other pandas method. – hpaulj Nov 11 '16 at 23:10
  • What's wrong with using list(X.columns)? – user48956 Nov 11 '16 at 23:12

5 Answers5

8

Consider a DF as shown below:

X = pd.DataFrame(dict(one=['Strawberry', 'Fields', 'Forever'], two=[1,2,3]))
X

enter image description here

Provide a list of tuples as data input to the structured array:

arr_ip = [tuple(i) for i in X.as_matrix()]

Ordered list of field names:

dtyp = np.dtype(list(zip(X.dtypes.index, X.dtypes)))

Here, X.dtypes.index gives you the column names and X.dtypes it's corresponding dtypes which are unified again into a list of tuples and fed as input to the dtype elements to be constructed.

arr = np.array(arr_ip, dtype=dtyp)

gives:

arr
# array([('Strawberry', 1), ('Fields', 2), ('Forever', 3)], 
#       dtype=[('one', 'O'), ('two', '<i8')])

and

arr.dtype.names
# ('one', 'two')
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • 1
    Cool. Thanks. But what's up with this: type(arr[0]) gives – user48956 Nov 11 '16 at 22:09
  • Doing `arr[0]` gives you `('Strawberry', 1)`. As you can see, they form a tuple of a combination of `dtypes`, namely - `str` and `np.int64` respectively. `np.void` basically means that these data types do not fall under pre-defined types such as *int/float/bool/str/cfloat*, but form a collection instead whose type must be distinguished too.Hence, these are sometimes referred to as flexible/generic data types. – Nickil Maveli Nov 11 '16 at 22:27
  • wonderful! It works like a charm. However i do get a note "FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead." Using values() does however not give the desired result. – IceQueeny Oct 27 '18 at 17:03
7

Pandas dataframe also has a handy to_records method. Demo:

X = pd.DataFrame(dict(age=[40., 50., 60.], 
                      sys_blood_pressure=[140.,150.,160.]))
m = X.to_records(index=False)
print repr(m)

Returns:

rec.array([(40.0, 140.0), (50.0, 150.0), (60.0, 160.0)], 
          dtype=[('age', '<f8'), ('sys_blood_pressure', '<f8')])

This is a "record array", which is an ndarray subclass that allows field access using attributes, e.g. m.age in addition to m['age'].

You can pass this to a cython function as a regular float array by constructing a view:

m_float = m.view(float).reshape(m.shape + (-1,))
print repr(m_float)

Which gives:

rec.array([[  40.,  140.],
           [  50.,  150.],
           [  60.,  160.]], 
          dtype=float64)

Note in order for this to work, the original Dataframe must have a float dtype for every column. To make sure use m = X.astype(float, copy=False).to_records(index=False).

user7138814
  • 1,991
  • 9
  • 11
  • It is worth noting that `to_records` also has `index = True` (by default `True` which I'd suggest setting to `False` for most use cases), as well as a handy `column_dtypes = {'col_name': 'float64'}` arguments for pre-normalizing outputted array. – S0AndS0 Mar 19 '19 at 01:01
  • I really like this answer in addition to mine. This other overflow post is relevant when comparing structured arrays to record arrays: https://stackoverflow.com/questions/27995110/numpy-record-array-or-structured-array-or-recarray – D A Mar 12 '21 at 21:42
2

Yet more methods of converting a pandas.DataFrame to numpy.array while preserving label/column names

This is mainly for demonstrating how to set dtype/column_dtypes, because sometimes a data source iterator's output'll need some pre-normalization.


Method one inserts by column into a zeroed array of predefined height and is loosely based on a Creating Structured Arrays guide that just a bit of web-crawling turned up

import numpy


def to_tensor(dataframe, columns = [], dtypes = {}):
    # Use all columns from data frame if none where listed when called
    if len(columns) <= 0:
        columns = dataframe.columns
    # Build list of dtypes to use, updating from any `dtypes` passed when called
    dtype_list = []
    for column in columns:
        if column not in dtypes.keys():
            dtype_list.append(dataframe[column].dtype)
        else:
            dtype_list.append(dtypes[column])
    # Build dictionary with lists of column names and formatting in the same order
    dtype_dict = {
        'names': columns,
        'formats': dtype_list
    }
    # Initialize _mostly_ empty nupy array with column names and formatting
    numpy_buffer = numpy.zeros(
        shape = len(dataframe),
        dtype = dtype_dict)
    # Insert values from dataframe columns into numpy labels
    for column in columns:
        numpy_buffer[column] = dataframe[column].to_numpy()
    # Return results of conversion
    return numpy_buffer

Method two is based on user7138814's answer and will likely be more efficient as it is basically a wrapper for the built in to_records method available to pandas.DataFrames

def to_tensor(dataframe, columns = [], dtypes = {}, index = False):
    to_records_kwargs = {'index': index}
    if not columns:  # Default to all `dataframe.columns`
        columns = dataframe.columns
    if dtypes:       # Pull in modifications only for dtypes listed in `columns`
        to_records_kwargs['column_dtypes'] = {}
        for column in dtypes.keys():
            if column in columns:
                to_records_kwargs['column_dtypes'].update({column: dtypes.get(column)})
    return dataframe[columns].to_records(**to_records_kwargs)

With either of the above one could do...

X = pandas.DataFrame(dict(age = [40., 50., 60.], sys_blood_pressure = [140., 150., 160.]))

# Example of overwriting dtype for a column
X_tensor = to_tensor(X, dtypes = {'age': 'int32'})

print("Ages -> {0}".format(X_tensor['age']))
print("SBPs -> {0}".format(X_tensor['sys_blood_pressure']))

... which should output...

Ages -> array([40, 50, 60])
SBPs -> array([140., 150., 160.])

... and a full dump of X_tensor should look like the following.

array([(40, 140.), (50, 150.), (60, 160.)],
      dtype=[('age', '<i4'), ('sys_blood_pressure', '<f8')])

Some thoughts

While method two will likely be more efficient than the first, method one (with some modifications) may be more useful for merging two or more pandas.DataFrames into one numpy.array.

Additionally (after swinging back through to review), method one will likely face-plant as it's written with errors about to_records_kwargs not being a mapping if dtypes is not defined, next time I'm feeling Pythonic I may resolve that with an else condition.

S0AndS0
  • 860
  • 1
  • 7
  • 20
1

Create an example:

import pandas
import numpy
PandasTable = pandas.DataFrame( {
"AAA": [4, 5, 6, 7], 
"BBB": [10, 20, 30, 40], 
"CCC": [100, 50, -30, -50], 
"DDD": ['asdf1', 'asdf2', 'asdf3', 'asdf4'] } )

Solve the problem noting that we are creating something called a "structured numpy array":

NumpyDtypes             = list( PandasTable.dtypes.items() )
NumpyTable              = PandasTable.to_numpy(copy=True)
NumpyTableRows          = [ tuple(Row) for Row in NumpyTable]
NumpyTableWithHeaders   = numpy.array( NumpyTableRows, dtype=NumpyDtypes )

Rewrite the solution in 1 line of code:

NumpyTableWithHeaders2   = numpy.array( [ tuple(Row) for Row in PandasTable.to_numpy(copy=True)], dtype=list( PandasTable.dtypes.items() ) )

Print out the solution results:

print ('NumpyTableWithHeaders', NumpyTableWithHeaders)
print ('NumpyTableWithHeaders.dtype', NumpyTableWithHeaders.dtype)
print ('NumpyTableWithHeaders2', NumpyTableWithHeaders2)
print ('NumpyTableWithHeaders2.dtype', NumpyTableWithHeaders2.dtype)
NumpyTableWithHeaders [(4, 10, 100, 'asdf1') (5, 20,  50, 'asdf2') (6, 30, -30, 'asdf3')
 (7, 40, -50, 'asdf4')]
NumpyTableWithHeaders.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]
NumpyTableWithHeaders2 [(4, 10, 100, 'asdf1') (5, 20,  50, 'asdf2') (6, 30, -30, 'asdf3')
 (7, 40, -50, 'asdf4')]
NumpyTableWithHeaders2.dtype [('AAA', '<i8'), ('BBB', '<i8'), ('CCC', '<i8'), ('DDD', 'O')]

Documentation I had to read

Adding row/column headers to NumPy arrays

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html

How to keep column names when converting from pandas to numpy

https://numpy.org/doc/stable/user/basics.creation.html

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

https://docs.scipy.org/doc/numpy-1.10.1/user/basics.rec.html

Notes and thoughts: Pandas should add a flag in their 'to_numpy' function which does this. Recent version Numpy documentation should be updated to include structured arrays, which behave differently than regular ones.

D A
  • 3,130
  • 4
  • 25
  • 41
  • 1
    I made this solution out of the desire to avoid matrices, class definitions, and more than a handful of lines of code. I hope this helps. – D A Mar 12 '21 at 21:34
  • I suppose this is also relevant: https://stackoverflow.com/questions/46837472/converting-pandas-dataframe-to-structured-arrays – D A Mar 12 '21 at 21:43
  • Is it very different to to_records? – user48956 Mar 15 '21 at 17:16
  • According to this post: https://stackoverflow.com/questions/27995110/numpy-record-array-or-structured-array-or-recarray record arrays are slower than structured arrays. – D A Mar 19 '21 at 23:22
0

OK, here where I'm leaning:

class NDArrayWithColumns(np.ndarray):
    def __new__(cls, obj,  columns=None):
        obj = obj.view(cls)
        obj.columns = columns
        return obj

    def __array_finalize__(self, obj):
        if obj is None: return
        self.columns = getattr(obj, 'columns', None)

    @staticmethod
    def from_dataframe(df):
        cols = tuple(df.columns)
        arr = df.as_matrix(cols)
        return NDArrayWithColumns.from_array(arr,cols)

    @staticmethod
    def from_array(array,columns):
        if isinstance(array,NDArrayWithColumns):
            return array
        return NDArrayWithColumns(array,tuple(columns))

    def __str__(self):
        sup = np.ndarray.__str__(self)
        if self.columns:
            header = ", ".join(self.columns)
            header = "# " + header + "\n"
            return header+sup
        return sup

NAN = float("nan")
X = pd.DataFrame(dict(age=[40., NAN, 60.], sys_blood_pressure=[140.,150.,160.]))
arr = NDArrayWithColumns.from_dataframe(X)
print arr
print arr.columns
print arr.dtype

Gives:

# age, sys_blood_pressure
[[  40.  140.]
 [  nan  150.]
 [  60.  160.]]
('age', 'sys_blood_pressure')
float64

and can also be passed to types cython function expecting a ndarray[2,double_t].

UPDATE: this works pretty good except for some oddness when passing the type to ufuncs.

Community
  • 1
  • 1
user48956
  • 14,850
  • 19
  • 93
  • 154