How to convert a pandas dataframe into a numpy array with the column names

Question

This must use vectorized methods, nothing iterative

I would like to create a numpy array from pandas dataframe.

My code:

import pandas as pd
_df = pd.DataFrame({'itme': ['book', 'book' , 'car', ' car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
 
item     color    val
book    green   -22.70
book    blue    -109.60
car     red     -57.19
car     green   -11.20
bike    blue    -25.60
bike    red     -33.61

There are about 12k million rows.

I need to create a numpy array like :

item    green    blue     red
book    -22.70  -109.60   null
car     -11.20   null     -57.19
bike    null    -25.60    -33.16

each row is the item name and each col is color name. The order of the items and colors are not important. But, in numpy array, there are no row and column names, I need to keep the item and color name for each value, so that I know what the value represents in the numpy array.

For example

 how to know that -57.19 is for "car" and "red" in numpy array ?

So, I need to create a dictionary to keep the mapping between :

  item <--> row index in the numpy array
  color <--> col index in the numpy array

I do not want to use iteritems and itertuples because they are not efficient for large dataframe due to How to iterate over rows in a DataFrame in Pandas and How to iterate over rows in a DataFrame in Pandas and Python Pandas iterate over rows and access column names and Does pandas iterrows have performance issues?

I prefer numpy vectorization solution for this.

How to efficiently convert the pandas dataframe to numpy array ? The array will also be transformed to torch.tensor.

thanks

Trenton McKinney · Accepted Answer · 2020-11-15T01:12:42.267

do a quick search for a val by their "item" and "color" with one of the following options:
1. Use pandas Boolean indexing
2. Convert the dataframe into a numpy.recarry using pandas.DataFrame.to_records, and also use Boolean indexing
.item is a method for both pandas and numpy, so don't use 'item' as a column name. It has been changed to '_item'.
As an FYI, numpy is a pandas dependency, and much of pandas vectorized functionality directly corresponds to numpy.

import pandas as pd
import numpy as np

# test data
df = pd.DataFrame({'_item': ['book', 'book' , 'car', 'car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})

# Use pandas Boolean index to
selected = df[(df._item == 'book') & (df.color == 'blue')]

# print(selected)
_item color    val
 book  blue -109.6

# Alternatively, create a recarray
v = df.to_records(index=False)

# display(v)
rec.array([('book', 'green',  -22.7 ), ('book', 'blue', -109.6 ),
           ('car', 'red',  -57.19), ('car', 'green',  -11.2 ),
           ('bike', 'blue',  -25.6 ), ('bike', 'red',  -33.61)],
          dtype=[('_item', 'O'), ('color', 'O'), ('val', '<f8')])

# search the recarray
selected = v[(v._item == 'book') & (v.color == 'blue')]

# print(selected)
[('book', 'blue', -109.6)]

Update in response to OP edit

You must first reshape the dataframe using pandas.DataFrame.pivot, and then use the previously mentioned methods.

dfp = df.pivot(index='_item', columns='color', values='val')

# display(dfp)
color   blue  green    red
_item                     
bike   -25.6    NaN -33.61
book  -109.6  -22.7    NaN
car      NaN  -11.2 -57.19

# create a numpy recarray
v = dfp.to_records(index=True)

# display(v)
rec.array([('bike',  -25.6,   nan, -33.61),
           ('book', -109.6, -22.7,    nan),
           ('car',    nan, -11.2, -57.19)],
          dtype=[('_item', 'O'), ('blue', '<f8'), ('green', '<f8'), ('red', '<f8')])

# select data
selected = v.blue[(v._item == 'book')]

# print(selected)
array([-109.6])

How to convert a pandas dataframe into a numpy array with the column names

1 Answers1

Update in response to OP edit

Linked