- This must use vectorized methods, nothing iterative
I would like to create a numpy array from pandas dataframe.
My code:
import pandas as pd
_df = pd.DataFrame({'itme': ['book', 'book' , 'car', ' car', 'bike', 'bike'], 'color': ['green', 'blue' , 'red', 'green' , 'blue', 'red'], 'val' : [-22.7, -109.6, -57.19, -11.2, -25.6, -33.61]})
item color val
book green -22.70
book blue -109.60
car red -57.19
car green -11.20
bike blue -25.60
bike red -33.61
There are about 12k million rows.
I need to create a numpy array like :
item green blue red
book -22.70 -109.60 null
car -11.20 null -57.19
bike null -25.60 -33.16
each row is the item name and each col is color name. The order of the items and colors are not important. But, in numpy array, there are no row and column names, I need to keep the item and color name for each value, so that I know what the value represents in the numpy array.
For example
how to know that -57.19 is for "car" and "red" in numpy array ?
So, I need to create a dictionary to keep the mapping between :
item <--> row index in the numpy array
color <--> col index in the numpy array
I do not want to use iteritems and itertuples because they are not efficient for large dataframe due to How to iterate over rows in a DataFrame in Pandas and How to iterate over rows in a DataFrame in Pandas and Python Pandas iterate over rows and access column names and Does pandas iterrows have performance issues?
I prefer numpy vectorization solution for this.
How to efficiently convert the pandas dataframe to numpy array ? The array will also be transformed to torch.tensor.
thanks