3

I have a file of 4.5MB (9,223,136 lines) with the following information:

0       0
0.0147938       3.67598e-07
0.0226194       7.35196e-07
0.0283794       1.10279e-06
0.033576        1.47039e-06
0.0383903       1.83799e-06
0.0424806       2.20559e-06
0.0465545       2.57319e-06
0.0499759       2.94079e-06

In each column a value is represented a value from 0 to 100 meaning a percentage. My goal is to draw a graphic in ggplot2 to see check the percentages between them (e.g. with 20% of column1 what is the percentage achieved on column2). Heres is my R script:

library(ggplot2)
dataset=read.table("~/R/datasets/cumul.txt.gz")
p <- ggplot(dataset,aes(V2,V1))
p <- p + geom_line()
p <- p + scale_x_continuous(formatter="percent") + scale_y_continuous(formatter="percent")
p <- p + theme_bw()
ggsave("~/R/grafs/cumul.png")

I'm having a problem because every time i run this R runs out of memory, giving the error: "Cannot allocate vector of size 128.0 Mb ". I'm running 32-bit R on a Linux machine and i have about 4gb free memory.

I thought on a workaround that consists of reducing the precision of these values (by rounding them) and eliminate duplicate lines so that i have less lines on the dataset. Could you give me some advice on how to do this?

Barata
  • 2,799
  • 3
  • 22
  • 20
  • 1
    do you have any other objects in your workspace? When this happens to me, I normally save the objects I need, and then restart with a clean workspace. Calling gc() may also help if you remove unwanted objects. – richiemorrisroe Jun 01 '11 at 11:03
  • @richie I start a clean R session calling only this script: `/usr/bin/R CMD BATCH --vanilla --no-timing ~/scripts/R/grafs/cumul.R ~/R/scripts_output/cumul.txt ` – Barata Jun 01 '11 at 11:26
  • does the script only create the example above or does it create more objects? – richiemorrisroe Jun 01 '11 at 11:28
  • @richiemorrisroe I try run it (with dummy data and clean workspace) on 8GB RAM, 64-Win. Process reach 6GB when I kill it. So only way is to limit size of data. – Marek Jun 01 '11 at 12:17

1 Answers1

12

Are you sure you have 9 million lines in a 4.5MB file (edit: perhaps your file is 4.5 GB??)? It must be heavily compressed -- when I create a file that is one tenth the size, it's 115Mb ...

n <- 9e5
set.seed(1001)
z <- rnorm(9e5)
z <- cumsum(z)/sum(z)
d <- data.frame(V1=seq(0,1,length=n),V2=z)
ff <- gzfile("lgfile2.gz", "w")
write.table(d,row.names=FALSE,col.names=FALSE,file=ff)
close(ff)
file.info("lgfile2.gz")["size"]

It's hard to tell from the information you've given what kind of "duplicate lines" you have in your data set ... unique(dataset) will extract just the unique rows, but that may not be useful. I would probably start by simply thinning the data set by a factor of 100 or 1000:

smdata <- dataset[seq(1,nrow(dataset),by=1000),]

and see how it goes from there. (edit: forgot a comma!)

Graphical representations of large data sets are often a challenge. In general you will be better off:

  • summarizing the data somehow before plotting it
  • using a specialized graphical type (density plots, contours, hexagonal binning) that reduces the data
  • using base graphics, which uses a "draw and forget" model (unless graphics recording is turned on, e.g. in Windows), rather than lattice/ggplot/grid graphics, which save a complete graphical object and then render it
  • using raster or bitmap graphics (PNG etc.), which only record the state of each pixel in the image, rather than vector graphics, which save all objects whether they overlap or not
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • You first two points are the most important (and are equivalent) - if you summarise the data appropriately, the drawing speed of base vs. grid won't be important. – hadley Jun 01 '11 at 19:55
  • 1
    @hadley: fair enough. From a beginning user's point of view there may be a difference between a black-box machine to summarize and plot the data in one step (appropriate use of `sum_stat()`, `plot(density(x))`, etc.) and having to do the summary themselves. I agree that it is more elegant and in the long run more efficient to find ways to summarize the data, although #3 and #4 can be useful to consider when hacking about in the short run. – Ben Bolker Jun 01 '11 at 20:04