I want to create a bunch of histograms from grouped data in pandas dataframe. Here's a link to a similar question. To generate some toy data that is very similar to what I am working with you can use the following code:
from pandas import DataFrame
import numpy as np
x = ['A']*300 + ['B']*400 + ['C']*300
y = np.random.randn(1000)
df = DataFrame({'Letter':x, 'N':y})
I want to put those histograms (read the binned data) in a new dataframe and save that for later processing. Here's the real kicker, my file is 6 GB, with 400k+ groups, just 2 columns.
I've thought about using a simple for loop to do the work:
data=[]
for group in df['Letter'].unique():
data.append(np.histogram(df[df['Letter']==group]['N'],range=(-2000,2000),bins=50,density=True)[0])
df2=DataFrame(data)
Note that the bins, range, and density keywords are all necessary for my purposes so that the histograms are consistent and normalized across the rows in my new dataframe df2 (parameter values are from my real dataset so its overkill on the toy dataset). And the for loop works great, on the toy dataset generates pandas dataframe of 3 rows and 50 columns as expected. On my real dataset I've estimated that time to completion of the code would be around 9 days. Is there any better/faster way to do what I'm looking for?
P.S. I've thought about multiprocessing, but I think the overhead of creating processes and slicing data would be slower than just running this serially (I may be wrong and wouldn't mind to be corrected on this one).