How to reduce memory overhead of multiple overlaid matplotlib loglog plots?

Question

I have 40 data sets, each about 115MB in size, and I would like to plot them all together on the same plot in log log scale.

# make example data 
import numpy as np
data_x = []
data_y = []
for _ in range(40):
    x, y = np.random.random(size = (2, int(7e6))) # 7e6 chosen to make about 115MB size
    data_x.append(x)
    data_y.append(y)
del x, y

# now show the size of one set in MB
print((data_x[0].nbytes + data_y[0].nbytes)/1e6, 'MB')
# 112.0 MB

My computer has about 30GB of available ram, so I fully expect the 40*112MB = 4.5GB to fit.

I would like to make an overlaid log log plot of every data set:

import matplotlib.pyplot as plt 
for x,y in zip(data_x, data_y):
    plt.loglog(x, y)
plt.show()

But the memory overhead is too large. I'd prefer not to downsample the data. Is there a way I might reduce the memory overhead in order to plot this 4.5GB of data ?

I would prefer to keep the for loop as I need to modify the point style and color of each plot in it, so to concatenate the datasets is unfavorable.

The most similar question I could find is here, but this differs in that the loop is used to create distinct plots, instead of to add to the same plot, so adding a plt.clf() command into the loop does not help me.

This sound like the definition of overplotting. Maybe you should bin your data? There is no way, that displaying that amount of points yields any value — user8408080, Mar 11 '19 at 23:39
Yeah, I mean I could bin, but I'd have to use an exponentially growing bin size, since my data spans multiple orders of magnitude. Definitely not a quick and dirty solution. It'd be much simpler if there were a clean matplotlib capability to control memory overhead, such as sequentially plotting onto the png output of the previous call, for example. I'm just asking the community if an option exists. — kevinkayaks, Mar 11 '19 at 23:42
What about matplotlib makes writing 4.5GB of data into a 500x500 pixel image cost more than 4.5GB in overhead? I'm just thinking I'm missing something... — kevinkayaks, Mar 11 '19 at 23:47
You can expect the matplotlib figure object and its children to become much larger than the raw bytesize of the data. I don't think you're missing something. Your data is just too large to be plotted with matplotlib. I would definitely consider some sort of binning. — ImportanceOfBeingErnest, Mar 12 '19 at 00:12
Thanks guys. I'll pursue the binning. Please see https://stackoverflow.com/questions/55112430/bin-one-column-and-sum-the-other-of-2-n-array if you have time. Thx — kevinkayaks, Mar 12 '19 at 00:43

score 1 · Answer 1 · answered Mar 12 '19 at 00:45

Here is my attempt at solving your problem:

# make example data 
import numpy as np
import matplotlib.pyplot as plt
import colorsys

data_x = np.random.random((40, int(7e6)))*np.logspace(0, 7, 40)[:, None]
data_y = np.random.random((40, int(7e6)))*np.logspace(0, 7, 40)[:, None]

# now show the size of one set in MB
print((data_x[0].nbytes + data_y[0].nbytes)/1e6, 'MB')

x, y = np.log(data_x), np.log(data_y)

hists = [np.histogram2d(x_, y_, bins=1000) for x_, y_ in zip(x,y)]

N = len(hists)

for i, h in enumerate(hists):
    color = colorsys.hsv_to_rgb(i/N, 1, 1)
    rows, cols = np.where(h[0]>0)
    plt.scatter(h[1][rows], h[2][cols], color=color, s=1)

Result:

I take the log of both the x and y data and then proceed to bin it. As I don't think, that you are interested in densities, I just plotted a static color, where a bin contains more than one element.

thanks @user8408080 -- this does work! but I need more control over the bin size. How to generate the bins more carefully? I'd like one bin between `0` and `10`, then ten bins between `10^k` and `10^{k+1}` for all k>0 — kevinkayaks, Mar 12 '19 at 01:00
As stated in the [docs](https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram2d.html), you can set your own edges for the bins — user8408080, Mar 12 '19 at 01:10

How to reduce memory overhead of multiple overlaid matplotlib loglog plots?

1 Answers1