I want to do a 2D histogram of a large data set. If I open all the data at once I get a MemoryError message. So I have subdivided the data into smaller chunks that can be loaded separatedly. The problem is that I can make a 2D histogram for each chunk but not for the entire dataset at once, since I can't load all at once.
Let's see a toy example. Immagine here that x and y is the data coming from one of my chunks. In this example the "chunk" consist of a million data points:
import numpy as np
import matplotlib.pyplot as plt
x=np.random.normal(-1,1,1000000)
y=np.random.normal(-1,1,1000000)
#CLASSIC WAY TO MAKE A 2D HISTOGRAM
plt.hist2d(x,y,bins=100)
plt.savefig("histo1.png",dpi=800)
plt.close()
#ALTERNATIVE WAY TO MAKE A 2D HISTOGRAM
arr, xedges, yedges=np.histogram2d(x,y,bins=100)
frame=[xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.imshow(np.rot90(arr), interpolation="none", extent=frame)
plt.savefig("histo2.png",dpi=800)
plt.close()
So here I show two different ways of making exactly the same image: The classic way involves using matplotlib.pyplot.hist2d() to directly plot the histogram from the data and the alternative way involves creating the density matrix from the data first (with numpy.histogram2d()) and then using matplotlib.pyplot.imshow() to plot it.
All of this works fine, and both methods yield the exact same image. But now I want to make the histogram for the entire dataset. I can't do it with the classic way since loading x=x_data_chunk1+x_data_chunk2+... and y=y_data_chunk1+y_data_chunk2+... is impossible because of RAM memory overloading. But in theory the alternative method should work; I can load the first chunk and extract x and y, then do a density matrix of that with numpy.histogram2d(), store it, delete the data of x and y and do the same with the next chunk. In the end we will have a density matrix for each chunk without loading any two of them at the same time. The density matrices are not a problem since they are only grids that count the number of data points in each cell (they occupy a lot less than the actual data obviously). Finally we add all density matrices in one an plot the final arrangement with matplotlib.pyplot.imshow().
In theory this should work. So I first tried a toy example that uses two data chunks (x1,y1) and (x2,y2) to see if the output looks like what one would expect following the classic way if the data could be loaded at the same time (just to see if it's the same image in both ways as before even if in the real situation It's going to be impossible to do with the classical way because of MemoryError).
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
#CHUNK OF DATA 1
x1=np.random.normal(-1,1,1000000)
y1=np.random.normal(-1,1,1000000)
#CHUNK OF DATA 2
x2=np.random.normal(-1,1,1000000)
y2=np.random.normal(-1,1,1000000)
#CLASSIC WAY
x=x1+x2
y=y1+y2
plt.hist2d(x,y,bins=100)
plt.savefig("histo1.png",dpi=800)
plt.close()
#ALTERNATIVE WAY
arr1, xedges1, yedges1=np.histogram2d(x1,y1,bins=100)
arr2, xedges2, yedges2=np.histogram2d(x2,y2,bins=100)
arr=arr1+arr2
frame=[min([xedges1[0],xedges2[0]]), max([xedges1[-1],xedges2[-1]]), min([yedges1[0],yedges2[0]]), max([yedges1[-1],yedges2[-1]])]
plt.imshow(np.rot90(arr), interpolation="none", extent=frame)
plt.savefig("histo2.png",dpi=800)
plt.close()
Doing this I get two completly different images. Why is this happening? The title is for a more general question that I tried to solve here, but I really would like to know how to solve this particular situation also.
The true question here is How can I make 2D histograms adding up density matrices for different data chunks and have the same result as if I was using matplotlib.pyplot.hist2d() for the entire dataset?
There wasn't enought space in the title for this question and I can't immagine a more condensed version of it so in the end I decided to make a more general question and explain my particular case. Sorry If this is inconvenient.