-1

I have 40 data sets, each about 115MB in size, and I would like to plot them all together on the same plot in log log scale.

# make example data 
import numpy as np
data_x = []
data_y = []
for _ in range(40):
    x, y = np.random.random(size = (2, int(7e6))) # 7e6 chosen to make about 115MB size
    data_x.append(x)
    data_y.append(y)
del x, y

# now show the size of one set in MB
print((data_x[0].nbytes + data_y[0].nbytes)/1e6, 'MB')
# 112.0 MB

My computer has about 30GB of available ram, so I fully expect the 40*112MB = 4.5GB to fit.

I would like to make an overlaid log log plot of every data set:

import matplotlib.pyplot as plt 
for x,y in zip(data_x, data_y):
    plt.loglog(x, y)
plt.show()

But the memory overhead is too large. I'd prefer not to downsample the data. Is there a way I might reduce the memory overhead in order to plot this 4.5GB of data ?

I would prefer to keep the for loop as I need to modify the point style and color of each plot in it, so to concatenate the datasets is unfavorable.

The most similar question I could find is here, but this differs in that the loop is used to create distinct plots, instead of to add to the same plot, so adding a plt.clf() command into the loop does not help me.

kevinkayaks
  • 2,636
  • 1
  • 14
  • 30
  • This sound like the definition of overplotting. Maybe you should bin your data? There is no way, that displaying that amount of points yields any value – user8408080 Mar 11 '19 at 23:39
  • Yeah, I mean I could bin, but I'd have to use an exponentially growing bin size, since my data spans multiple orders of magnitude. Definitely not a quick and dirty solution. It'd be much simpler if there were a clean matplotlib capability to control memory overhead, such as sequentially plotting onto the png output of the previous call, for example. I'm just asking the community if an option exists. – kevinkayaks Mar 11 '19 at 23:42
  • What about matplotlib makes writing 4.5GB of data into a 500x500 pixel image cost more than 4.5GB in overhead? I'm just thinking I'm missing something... – kevinkayaks Mar 11 '19 at 23:47
  • 2
    You can expect the matplotlib figure object and its children to become much larger than the raw bytesize of the data. I don't think you're missing something. Your data is just too large to be plotted with matplotlib. I would definitely consider some sort of binning. – ImportanceOfBeingErnest Mar 12 '19 at 00:12
  • Thanks guys. I'll pursue the binning. Please see https://stackoverflow.com/questions/55112430/bin-one-column-and-sum-the-other-of-2-n-array if you have time. Thx – kevinkayaks Mar 12 '19 at 00:43

1 Answers1

1

Here is my attempt at solving your problem:

# make example data 
import numpy as np
import matplotlib.pyplot as plt
import colorsys

data_x = np.random.random((40, int(7e6)))*np.logspace(0, 7, 40)[:, None]
data_y = np.random.random((40, int(7e6)))*np.logspace(0, 7, 40)[:, None]

# now show the size of one set in MB
print((data_x[0].nbytes + data_y[0].nbytes)/1e6, 'MB')

x, y = np.log(data_x), np.log(data_y)

hists = [np.histogram2d(x_, y_, bins=1000) for x_, y_ in zip(x,y)]

N = len(hists)

for i, h in enumerate(hists):
    color = colorsys.hsv_to_rgb(i/N, 1, 1)
    rows, cols = np.where(h[0]>0)
    plt.scatter(h[1][rows], h[2][cols], color=color, s=1)

Result:

enter image description here

I take the log of both the x and y data and then proceed to bin it. As I don't think, that you are interested in densities, I just plotted a static color, where a bin contains more than one element.

user8408080
  • 2,428
  • 1
  • 10
  • 19
  • thanks @user8408080 -- this does work! but I need more control over the bin size. How to generate the bins more carefully? I'd like one bin between `0` and `10`, then ten bins between `10^k` and `10^{k+1}` for all k>0 – kevinkayaks Mar 12 '19 at 01:00
  • As stated in the [docs](https://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram2d.html), you can set your own edges for the bins – user8408080 Mar 12 '19 at 01:10