Big dataset contour plot using pyplot and pandas

Question

I have a massive data sample and need to visualize it. Using pandas, I can create a dataframe with relevant variables- 3 arrays of length 20Million.

These are x,y geometrical coordinates and z value on that (x,y) point.

I need a "heatmap" of z at each (x,y) point. But no pyplot function works with numbers this big.

What is the best way to go about it?

that level of detail is simply not going to be visible unless you have an enormous output file and are willing to zoom around it. Can you aggregate your data in any way? — asongtoruin, Sep 20 '18 at 11:12
I am happy to "bin" it in ranges of X,y and average it, or something similar. I was considering if I could use 2d histogram somehow. But I don't need "density of X,y" I need "z for every X,y" — Gedas Sarpis, Sep 20 '18 at 12:03
If you have any duplicate data values, removing them will help reduce the data size. — James Phillips, Sep 20 '18 at 14:59
Maybe you are looking for [`scipy.stats.binned_statistic_2d`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic_2d.html)? This could hence be a duplicate of [this question](https://stackoverflow.com/questions/6163334/binning-data-in-python-with-scipy-numpy). — ImportanceOfBeingErnest, Sep 20 '18 at 23:14

score 1 · Answer 1 · answered Sep 20 '18 at 15:11

Dummy data

Tested with 200,000 rows

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
df=pd.DataFrame(np.random.rand(200000,2), columns=['X','Y'])
df['Z']=df.apply(lambda x: x.X+x.Y*2, axis=1)

Code

Creating bin intervals and groupby dataframe applying mean to Z column, so have mean Z for every X, Y bin pair to plot. Finally, scatter plot

binsX = pd.cut(df.X, np.arange(0,1,0.001))
binsY = pd.cut(df.Y, np.arange(0,1,0.001))    
binned = df.groupby([binsX,binsY])['Z'].mean().reset_index()
binned.X = binned.X.apply(lambda x: x.mid)
binned.Y = binned.Y.apply(lambda y: y.mid)
plt.scatter(binned.X, binned.Y, c=binned.Z, s=0.01)

Big dataset contour plot using pyplot and pandas

1 Answers1

Dummy data

Code