Plot really big file in python (5GB) with x axis offset

Question

I am trying to plot a very big file (~5 GB) using python and matplotlib. I am able to load the whole file in memory (the total available in the machine is 16 GB) but when I plot it using simple imshow I get a segmentation fault. This is most probable to the ulimit which I have set to 15000 but I cannot set higher. I have come to the conclusion that I need to plot my array in batches and therefore made a simple code to do that. My main isue is that when I plot a batch of the big array the x coordinates start always from 0 and there is no way I can overlay the images to create a final big one. If you have any suggestion please let me know. Also I am not able to install new packages like "Image" on this machine due to administrative rights. Here is a sample of the code that reads the first 12 lines of my array and make 3 plots.

import os
import sys
import scipy
import numpy as np
import pylab as pl
import matplotlib as mpl
import matplotlib.cm as cm
from optparse import OptionParser
from scipy import fftpack
from scipy.fftpack import *
from cmath import *
from pylab import *
import pp
import fileinput
import matplotlib.pylab as plt
import pickle

def readalllines(file1,rows,freqs):
    file = open(file1,'r')
    sizer = int(rows*freqs)
    i = 0
    q = np.zeros(sizer,'float')
    for i in range(rows*freqs):
        s =file.readline()
        s = s.split()
        #print s[4],q[i]
        q[i] = float(s[4])
        if i%262144 == 0:
            print '\r ',int(i*100.0/(337*262144)),'  percent complete',
        i += 1
    file.close()
    return q

parser = OptionParser()
parser.add_option('-f',dest="filename",help="Read dynamic spectrum from FILE",metavar="FILE")
parser.add_option('-t',dest="dtime",help="The time integration used in seconds, default 10",default=10)
parser.add_option('-n',dest="dfreq",help="The bandwidth of each frequency channel in Hz",default=11.92092896)
parser.add_option('-w',dest="reduce",help="The chuncker divider in frequency channels, integer default 16",default=16)
(opts,args) = parser.parse_args()
rows=12
freqs = 262144

file1 = opts.filename

s = readalllines(file1,rows,freqs)
s = np.reshape(s,(rows,freqs))
s = s.T
print s.shape
#raw_input()

#s_shift = scipy.fftpack.fftshift(s)


#fig = plt.figure()

#fig.patch.set_alpha(0.0)
#axes = plt.axes()
#axes.patch.set_alpha(0.0)
###plt.ylim(0,8)

plt.ion()

i = 0
for o in range(0,rows,4):

    fig = plt.figure()
    #plt.clf()

    plt.imshow(s[:,o:o+4],interpolation='nearest',aspect='auto', cmap=cm.gray_r, origin='lower')
    if o == 0:
        axis([0,rows,0,freqs])
    fdf, fdff = xticks()
    print fdf
    xticks(fdf+o)
    print xticks()
    #axis([o,o+4,0,freqs])
    plt.draw()

    #w, h = fig.canvas.get_width_height()
    #buf = np.fromstring(fig.canvas.tostring_argb(), dtype=np.uint8)
    #buf.shape = (w,h,4)

    #buf = np.rol(buf, 3, axis=2)
    #w,h,_ = buf.shape
    #img = Image.fromstring("RGBA", (w,h),buf.tostring())

    #if prev:
    #    prev.paste(img)
    #    del prev
    #prev = img
    i += 1
pl.colorbar()
pl.show()

Why are you reading the whole file into memory rather than *just* stripping out the information you need? Is every bit of the file needed by your program? Or rather, why **were** you doing that? — Chris Pfohl, Nov 01 '12 at 18:42
the whole file is an image I need to plot. Its value of it is a poixel value so I need the whole file to be plotted. Since it could be loaded in memory I thought it would make thinks faster than reading lines of it every time. Bottom line I need every bit of the files info on the plot in the end. — thenoone, Nov 01 '12 at 18:53
I think you need to hack the pan/zoom tools to dynamically load the data from the file, or you need to down-sample the image. In addition to your data, I am pretty sure that imshow stores a 3xNxM array of the RGB values, so if your memory foot print really explodes. — tacaswell, Nov 01 '12 at 19:52
also, your loops are a little screwy, you shouldn't be doing i+=1 in a for loop over i. — tacaswell, Nov 01 '12 at 19:54
and you can install additional packages anywhere on your system and then add that directory to your pythonpath. — tacaswell, Nov 01 '12 at 19:54
Also, have have 262k samples in direction, if you plotted those at one pixel per sample on a 300ppi display, it would need to be 873in wide display to see every pixel. — tacaswell, Nov 01 '12 at 19:59
What exactly are you plotting (as in, what type of graph are you producing?)? I think you'll find you get better help if you describe *what* you're trying to do, rather than *how* you want to do it. — Chris Pfohl, Nov 01 '12 at 20:05
Trying to plot the raw data in this type of situation is pointless. If you take, say a mere 1 millisecond to look at each pixel, it will take you ~350 hours to examine your data. Instead, you should try to pre-process your data to extract the various features that you're interested in, and view them in a simplified way that the human brain can handle. — tom10, Nov 01 '12 at 20:44
@tom10 It's interesting how we need that kind of comparisons to understand big numbers... I made similar remarks yesterday in another question (with little success despite being about 7 Tib instead of 5 GiB...). See http://stackoverflow.com/questions/13143052/iterate-two-or-more-lists-numpy-arrays-and-compare-each-item-with-each-othe/13143382#comment17931433_13143382 — jorgeca, Nov 02 '12 at 10:39
Hi Guys, the data I am plotting are radio astronomical observations and the file I have contains 337 rows and 262144 columns. The rows are the time slices and the columns are the frequency channels. I am later on performing a 2d-fft on this data to get another image, in that image the length of the y axis is directly proportional to the amount of channels I have, so I need 262144 to get the minimum time delay I am interested in. I hope it became more clear know why I need these data to be that long. I am only planning to examine the whole dataset for abnormalities before ffting. — thenoone, Nov 02 '12 at 12:06
So I am not going to plot them on a big piece of paper or have them on 40 40" monitors. I will need to find, if any, regions of the plot that are interesting and then zoom in to them. Thank you for taking time to read my question and give feedback. — thenoone, Nov 02 '12 at 12:10
I think I can solve this problem if someone helps me on how I can plot two parts of an array in one figure next to each other. So that I can load 4 lines out of the 337 and plot them then another 4 and plot them contiguous next to the previous ones. — thenoone, Nov 02 '12 at 12:17
The two parts of the array would have to be storred in different images and in the end all images need to be next to each other in one figure. — thenoone, Nov 02 '12 at 12:24
Thanks for the input, I now see your actual issue (which is simply how to plot several arrays in the same figure with no overlap, right?). Could you try distilling your question so that it becomes more clear and thus more useful for other people? — jorgeca, Nov 02 '12 at 13:58

score 4 · Answer 1 · edited May 23 '17 at 10:24

If you plot any array with more than ~2k pixels across something in your graphics chain will down sample the image in some way to display it on your monitor. I would recommend down sampling in a controlled way, something like

data = convert_raw_data_to_fft(args) # make sure data is row major
def ds_decimate(row,step = 100):
    return row[::step]
def ds_sum(row,step):
    return np.sum(row[:step*(len(row)//step)].reshape(-1,step),1)
# as per suggestion from tom10 in comments
def ds_max(row,step): 
    return np.max(row[:step*(len(row)//step)].reshape(-1,step),1)
data_plotable = [ds_sum(d) for d in data] # plug in which ever function you want

or interpolation.

+1: for dealing with down-sampling in a controlled way. Personally, I'd plot the envelopes (that is, say the max and min of every set of 10000 pts) and maybe the envelopes of the diff, since OP is looking for outliers, but yours is the right basic approach, imho. — tom10, Nov 02 '12 at 15:52

ChrisB · Answer 2 · 2012-11-03T15:12:21.210

Matplotlib is pretty memory-inefficient when plotting images. It creates several full-resolution intermediate arrays, which is probably why your program is crashing.

One solution is to downsample the image before feeding it into matplotlib, as @tcaswell suggests.

I also wrote some wrapper code to do this downsampling automatically, based on your screen resolution. It's at https://github.com/ChrisBeaumont/mpl-modest-image, if it's useful. It also has the advantage that the image is resampled on the fly, so you can still pan and zoom without sacrificing resolution where you need it.

score 0 · Accepted Answer · answered Nov 02 '12 at 13:52

I think you're just missing the extent=(left, right, bottom, top) keyword argument in plt.imshow.

x = np.random.randn(2, 10)
y = np.ones((4, 10))
x[0] = 0  # To make it clear which side is up, etc
y[0] = -1

plt.imshow(x, extent=(0, 10, 0, 2))
plt.imshow(y, extent=(0, 10, 2, 6))
# This is necessary, else the plot gets scaled and only shows the last array
plt.ylim(0, 6)
plt.colorbar()
plt.show()

enter image description here

Plot really big file in python (5GB) with x axis offset

3 Answers3

Linked