i want to calculate a correlation matrix of a large dataset (1M rows). The idea is to calculate the correlation of product sales. If two products have a similar increase/decrease in sales year over year there might be a correlation.
I have already tried the posts here:
- How to get correlation matrix values pyspark
- How to get the correlation matrix of a pyspark data frame?
- How to plot correlation heatmap when using pyspark+databricks
- https://gist.github.com/cameres/bc24ac6711c9e537dd20be47b2a83558
Which all more or less do the same but they collect the correlation matrix back at the driver. Which is a problem since the large dataset makes this collection RAM intense. I am looking for a way to break this problem into pieces and make use of the distributed computing of Spark. There are 170k unique products so the job runs 170k times and there are 29B combinations.
My idea is to calculate the correlation column by column (Cross-Apply) and then collect it in a data frame (or RDD) to run filters over it like (only with correlation > 0.8). But I have no good idea to start.
The Dataset basically looks like this.
d = {'Product': ['A', 'B', 'C','A', 'B', 'C','A', 'B', 'C'],\
'Year': [2010, 2010, 2010, 2011, 2011, 2011, 2012, 2012, 2012],\
'Revenue': [100, 200, 300, 110, 190, 320, 120, 220, 350]}
df = pd.DataFrame(data=d)
I transpose the data to have the years in columns.
df = df.pivot(index='Product', columns='Year', values='Revenue').fillna(0)
I calculate the pct_change to have the relative change year over year.
df_diff = df.pct_change(axis=1).replace([np.inf, -np.inf], np.nan).fillna(0)
Year 2010 2011 2012
Product
A 0.0 0.100000 0.090909
B 0.0 -0.050000 0.157895
C 0.0 0.066667 0.093750
And I would need the correlation... With pandas easy
# change structure
df_diff = df_diff.stack().unstack(level=0)
# get correlation
df_diff = df_diff.corr().abs()
# change structure back
df_diff = df_diff.unstack().to_frame(name='value')
df_diff.index = df_diff.index.set_names(['Product_1', 'Product_2'])
df_diff.reset_index(inplace=True)
Product_1 Product_2 value
0 A A 1.000000
1 A B 0.207317
2 A C 0.933485
3 B A 0.207317
4 B B 1.000000
5 B C 0.544352
6 C A 0.933485
7 C B 0.544352
8 C C 1.000000