How can I make 2D histograms based on large data with matplolib?

Question

I want to do a 2D histogram of a large data set. If I open all the data at once I get a MemoryError message. So I have subdivided the data into smaller chunks that can be loaded separatedly. The problem is that I can make a 2D histogram for each chunk but not for the entire dataset at once, since I can't load all at once.

Let's see a toy example. Immagine here that x and y is the data coming from one of my chunks. In this example the "chunk" consist of a million data points:

import numpy as np
import matplotlib.pyplot as plt

x=np.random.normal(-1,1,1000000)
y=np.random.normal(-1,1,1000000)

#CLASSIC WAY TO MAKE A 2D HISTOGRAM
plt.hist2d(x,y,bins=100)
plt.savefig("histo1.png",dpi=800)
plt.close()

#ALTERNATIVE WAY TO MAKE A 2D HISTOGRAM
arr, xedges, yedges=np.histogram2d(x,y,bins=100)
frame=[xedges[0], xedges[-1], yedges[0], yedges[-1]]
plt.imshow(np.rot90(arr), interpolation="none", extent=frame)
plt.savefig("histo2.png",dpi=800)
plt.close()

So here I show two different ways of making exactly the same image: The classic way involves using matplotlib.pyplot.hist2d() to directly plot the histogram from the data and the alternative way involves creating the density matrix from the data first (with numpy.histogram2d()) and then using matplotlib.pyplot.imshow() to plot it.

All of this works fine, and both methods yield the exact same image. But now I want to make the histogram for the entire dataset. I can't do it with the classic way since loading x=x_data_chunk1+x_data_chunk2+... and y=y_data_chunk1+y_data_chunk2+... is impossible because of RAM memory overloading. But in theory the alternative method should work; I can load the first chunk and extract x and y, then do a density matrix of that with numpy.histogram2d(), store it, delete the data of x and y and do the same with the next chunk. In the end we will have a density matrix for each chunk without loading any two of them at the same time. The density matrices are not a problem since they are only grids that count the number of data points in each cell (they occupy a lot less than the actual data obviously). Finally we add all density matrices in one an plot the final arrangement with matplotlib.pyplot.imshow().

In theory this should work. So I first tried a toy example that uses two data chunks (x1,y1) and (x2,y2) to see if the output looks like what one would expect following the classic way if the data could be loaded at the same time (just to see if it's the same image in both ways as before even if in the real situation It's going to be impossible to do with the classical way because of MemoryError).

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl

#CHUNK OF DATA 1
x1=np.random.normal(-1,1,1000000)
y1=np.random.normal(-1,1,1000000)

#CHUNK OF DATA 2
x2=np.random.normal(-1,1,1000000)
y2=np.random.normal(-1,1,1000000)

#CLASSIC WAY
x=x1+x2
y=y1+y2
plt.hist2d(x,y,bins=100)
plt.savefig("histo1.png",dpi=800)
plt.close()

#ALTERNATIVE WAY
arr1, xedges1, yedges1=np.histogram2d(x1,y1,bins=100)
arr2, xedges2, yedges2=np.histogram2d(x2,y2,bins=100)
arr=arr1+arr2
frame=[min([xedges1[0],xedges2[0]]), max([xedges1[-1],xedges2[-1]]), min([yedges1[0],yedges2[0]]), max([yedges1[-1],yedges2[-1]])]
plt.imshow(np.rot90(arr), interpolation="none", extent=frame)
plt.savefig("histo2.png",dpi=800)
plt.close()

Doing this I get two completly different images. Why is this happening? The title is for a more general question that I tried to solve here, but I really would like to know how to solve this particular situation also.

The true question here is How can I make 2D histograms adding up density matrices for different data chunks and have the same result as if I was using matplotlib.pyplot.hist2d() for the entire dataset?

There wasn't enought space in the title for this question and I can't immagine a more condensed version of it so in the end I decided to make a more general question and explain my particular case. Sorry If this is inconvenient.

Why do you want to add `x1` and `x2` - don't you want to concatenate them? i.e. `x = np.concatenate([x1,x2])`. Same for `y`. If this solves the issue let me know and I can formulate an answer. — KolaB, Mar 02 '18 at 18:46

score 2 · Answer 1 · edited Jun 20 '20 at 09:12

I guess there are two problems.

First is a simple error in the code. x=x1+x2 adds the two arrays, however what you really want is to append one to the other. While this is just for the sake of the toy example, it should not matter in the real case, but it will lead to you seeing different images.

Second: Of course you need to use the same bins for all histograms. So if you know the bins to use beforehands, e.g. because the minimum and maximum data are known, this makes it easy.

import numpy as np
import matplotlib.pyplot as plt


#CHUNK OF DATA 1
x1=np.random.normal(-1,1,1000000)
y1=np.random.normal(-1,1,1000000)

#CHUNK OF DATA 2
x2=np.random.normal(-1,1,1000000)
y2=np.random.normal(-1,1,1000000)

#CLASSIC WAY
x=np.concatenate((x1,x2))
y=np.concatenate((y1,y2))
plt.hist2d(x,y,bins=np.linspace(-3,3,101))
plt.savefig("histo1.png",dpi=200)

#ALTERNATIVE WAY
arr1, xedges1, yedges1=np.histogram2d(x1,y1,bins=np.linspace(-3,3,101))
arr2, xedges2, yedges2=np.histogram2d(x2,y2,bins=np.linspace(-3,3,101))
arr=arr1+arr2

plt.figure()
plt.pcolormesh(xedges1,yedges1,arr.T)
plt.savefig("histo2.png",dpi=200)
plt.show()

Both codes produce the same figure.

In case you do not know the bins in advance or cannot rely on just setting them to a useful number, I fear you need to follow a two-step procedure:

Read in the data chunk by chunk, find the maximum and minimum, update the overall maximum and minimum and store it for later use. Remove the chunks from the memory.
Read in the data again chunk by chunk and do the histogramming as above, use the previously stored minimum and maximum for the bins in each case.

A starting point may be this question: Numpy histogram of large arrays

How can I make 2D histograms based on large data with matplolib?

1 Answers1