9

I have a input file that has about 20 million lines. The size of the file is about 1.2 G. Is there anyway I can plot the data in R. Some of the columns have categories, most of them are numbers.

I have tried my plotting script with a small subset of the input file about 800K lines, but even though i have about 8G of RAM, I dont seem to be able to plot all the data. Is there any simple way to do this.

ThinkingStiff
  • 64,767
  • 30
  • 146
  • 239
Sam
  • 7,922
  • 16
  • 47
  • 62
  • 9
    What are you hoping to see in a plot with 20 million data points? – Chase May 29 '12 at 20:51
  • 5
    Regardless of computational capacity you're going to have to reduce your data via histograms, 1D and 2D density plots, hexbin plots, ... – Ben Bolker May 29 '12 at 21:14
  • ... continuing along the lines of @Paul Hiemstra's answer below -- if you give some more details about (a subset of) your data you might get an interesting discussion of visualization possibilities going here. Also, `ggplot` might be slower/more memory-hungry than other possibilities, if you really do want to plot every point. – Ben Bolker May 30 '12 at 14:28

5 Answers5

14

Without a more clear description of the kind of plot you want, it is hard to give concrete suggestions. However, in general there is no need to plot 20 million points in a plot. For example a timeseries could be represented by a splines fit, or some kind of average, e.g. aggregate hourly data to daily averages. Alternatively, you draw some subset of the data, e.g. only one point per day in the example of the timeseries. So I think your challenge is not as much getting 20M points, or even 800k, on a plot, but how to aggregate your data effectively in such a way that it conveys the message you want to tell.

Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • 3
    Sampling the data and repeating the process a few times would also show patterns hidden in the data. – Roman Luštrik May 30 '12 at 09:15
  • 1
    I agree with @RomanLuštrik, if the pattern repeats itself for say samples of 10.000 points you know the pattern is constant (aka stationary). If not, a sample of 10.000 points is not sufficient. – Paul Hiemstra May 30 '12 at 14:06
5

The package hexbin to plot hexbins instead of scatterplots for pairs of variables as suggested by Ben Bolker in Speed up plot() function for large dataset worked for me for 2 million records fairly with 4GB RAM. But it failed for 200 million records/rows for same set of variables. I tried reducing the bin size to adjust computation time vs. RAM usage but it did not help.

For 20 million records, you can try out hexbins with xbins = 20,30,40 to start with.

Community
  • 1
  • 1
KarthikS
  • 883
  • 1
  • 11
  • 17
3

plotting directly into a raster file device (calling png() for instance) is a lot faster. I tried plotting rnorm(100000) and on my laptop X11 cairo plot took 2.723 seconds, while png device finished in 2.001 seconds. with 1 million points, the numbers are 27.095 and 19.954 seconds.

I use Fedora Linux and here is the code.

f = function(n){
x = rnorm(n)
y = rnorm(n)
png('test.png')
plot(x, y)
dev.off()}

g = function(n){
x = rnorm(n)
y = rnorm(n)
plot(x, y)}

system.time(f(100000))
system.time(g(100000))
h2kyeong
  • 447
  • 3
  • 13
1

Increasing the memory with memory.limit() helped me ... This is for plotting with ggplot nearly 36K records.

Vidya
  • 71
  • 5
0

does expanding the available memory with memory.limit(size=2000) (or something bigger) help?

Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113