Plotting of very large data sets in R

Question

How can I plot a very large data set in R?

I'd like to use a boxplot, or violin plot, or similar. All the data cannot be fit in memory. Can I incrementally read in and calculate the summaries needed to make these plots? If so how?

@radek : "All the data cannot be fit in memory" seems like a good approximation of "far too large for R to handle". Whether it is 2Gb or 20Gb doesn't really matter any more, does it? — Joris Meys, Dec 03 '10 at 14:23
@Joris Unless OP has memory.limit too small or many needless columns or something else. This information could be relevant. — Marek, Dec 03 '10 at 16:38
@Marek : could be, but I'd like to assume that OP knows what he's doing. And I do believe in the tooth fairy too. — Joris Meys, Dec 03 '10 at 17:14
As people have assumed, "far too large to fit in memory". I was specifically looking for something that will work on 2 GB, 20 GB, or 200 GB. Doesn't need to be efficient. — Daniel Arndt, Dec 04 '10 at 16:27
very large should probably mean 64+ GB. Below that you can just run a amazon e2 unit with ~64 GB RAM. — mrsteve, Dec 05 '10 at 07:11
@mrsteve The exact size of "very large" is impossible to define, as it is relative. It makes no sense to put a discrete line where a dataset becomes "very large". The term was used in order to request a general answer for which the statement 'all the data cannot be fit in memory' was used to identify the concrete restrictions. I purposely did not define the exact size as it is irrelevant. The dataset could be 32 MB, and if my machine has 16 MB ram, that dataset is very large relative to the physical hardware available to me. — Daniel Arndt, Dec 06 '10 at 16:56

score 10 · Accepted Answer · edited May 23 '17 at 10:30

10

In supplement to my comment to Dmitri answer, a function to calculate quantiles using ff big-data handling package:

ffquantile<-function(ffv,qs=c(0,0.25,0.5,0.75,1),...){
 stopifnot(all(qs<=1 & qs>=0))
 ffsort(ffv,...)->ffvs
 j<-(qs*(length(ffv)-1))+1
 jf<-floor(j);ceiling(j)->jc
 rowSums(matrix(ffvs[c(jf,jc)],length(qs),2))/2
}

This is an exact algorithm, so it uses sorting -- and thus may take a lot of time.

edited May 23 '17 at 10:30

Community

1
1

answered Dec 03 '10 at 19:47

mbq

18,510
6
49
72

3

looks like you are trying to achieve some sort of <-...-> symmetry in your code ;) – VitoshKa Dec 03 '10 at 21:33
Thank you, I'll give this a shot. I suspect it will take time, but that's what servers are for ;) in the meantime I'll try sampling as Joris Meys had suggested – Daniel Arndt Dec 04 '10 at 16:30

Joris Meys · Answer 2 · 2010-12-03T18:24:01.577

Problem is you can't load all data into the memory. So you could do sampling of the data, as indicated earlier by @Marek. On such a huge datasets, you get essentially the same results even if you take only 1% of the data. For the violin plot, this will give you a decent estimate of the density. Progressive calculation of quantiles is impossible, but this should give a very decent approximation. It is essentially the same as the "randomized method" described in the link @aix gave.

If you can't subset the date outside of R, it can be done using connections in combination with sample(). Following function is what I use to sample data from a dataframe in text format when it's getting too big. If you play a bit with the connection, you could easily convert this to a socketConnection or other to read it from a server, a database, whatever. Just make sure you open the connection in the correct mode.

Good, take a simple .csv file, then following function samples a fraction p of the data:

sample.df <- function(f,n=10000,split=",",p=0.1){
    con <- file(f,open="rt",)
    on.exit(close(con,type="rt"))
    y <- data.frame()
    #read header
    x <- character(0)
    while(length(x)==0){
      x <- strsplit(readLines(con,n=1),split)[[1]]
    }
    Names <- x
    #read and process data
    repeat{
      x <- tryCatch(read.table(con,nrows=n,sep=split),error = function(e) NULL )
      if(is.null(x)) {break}
      names(x) <- Names
      nn <- nrow(x)
      id <- sample(1:nn,round(nn*p))
      y <- rbind(y,x[id,])
    }
    rownames(y) <- NULL
    return(y)
}

An example of the usage :

#Make a file
Df <- data.frame(
  X1=1:10000,
  X2=1:10000,
  X3=rep(letters[1:10],1000)
)
write.csv(Df,file="test.txt",row.names=F,quote=F)

# n is number of lines to be read at once, p is the fraction to sample
DF2 <- sample.df("test.txt",n=1000,p=0.2)
str(DF2)

#clean up
unlink("test.txt")

I had an algorithm implemented, but that one was so incredibly slow when tried on a real dataset, I deleted it again. It doesn't gain a thing, and in any case, "blocked" sampling as the function sample.df does, is by far the best approach when we're talking about sampling without distorting the distribution. — Joris Meys, Dec 03 '10 at 18:25
This was very useful in the end. Thank-you very much for your help Joris Meys. — Daniel Arndt, Dec 07 '10 at 08:07

score 4 · Answer 3 · answered Dec 03 '10 at 02:56

4

All you need for a boxplot are the quantiles, the "whisker" extremes, and the outliers (if shown), which is all easily precomputed. Take a look at the boxplot.stats function.

answered Dec 03 '10 at 02:56

Dmitri

8,999
5
36
43

3

But it's not possible to compute them exactly without loading all data into memory. – hadley Dec 03 '10 at 04:54
3

@hadley No, `ff` package allows you to count quantiles as usual but on data stored partially on hard drive. – mbq Dec 03 '10 at 11:41
@mbq : out of curiosity: which function in ff would do that? I saw ff especially as an interface to efficiently store large data, mostly in combination with genomics. But I can be totally wrong. – Joris Meys Dec 03 '10 at 14:03
1

@Joris -- good point; I thought it was there, but now I see I was wrong. Yet it is still possible to write such thing. – mbq Dec 03 '10 at 15:47
1

@Joris I wrote a function actually calculating quantiles using ff; comment is too limited, so I posted it as an answer. – mbq Dec 03 '10 at 19:41

score 4 · Answer 4 · answered Dec 03 '10 at 17:55

You should also look at the RSQLite, SQLiteDF, RODBC, and biglm packages. For large datasets is can be useful to store the data in a database and pull only pieces into R. The databases can also do sorting for you and then computing quantiles on sorted data is much simpler (then just use the quantiles to do the plots).

There is also the hexbin package (bioconductor) for doing scatterplot equivalents with very large datasets (probably still want to use a sample of the data, but works with a large sample).

score 4 · Answer 5 · answered Dec 03 '10 at 18:02

4

You could put the data into a database and calculate the quantiles using SQL. See : http://forge.mysql.com/tools/tool.php?id=149

answered Dec 03 '10 at 18:02

Matti Pastell

9,135
3
37
44

NPE · Answer 6 · 2010-12-03T10:18:00.240

This is an interesting problem.

Boxplots require quantiles. Computing quantiles on very large datasets is tricky.

The simplest solution that may or may not work in your case is to downsample the data first, and produce plots of the sample. In other words, read a bunch of records at a time, and retain a subset of them in memory (choosing either deterministically or randomly.) At the end, produce plots based on the data that's been retained in memory. Again, whether or not this is viable very much depends on the properties of your data.

Alternatively, there exist algorithms that can economically and approximately compute quantiles in an "online" fashion, meaning that they are presented with one observation at a time, and each observation is shown exactly once. While I have some limited experience with such algorithms, I have not seen any readily-available R implementations.

The following paper presents a brief overview of some relevant algorithms: Quantiles on Streams.

score 2 · Answer 7 · answered Dec 03 '10 at 09:59

You could make plots from manageable sample of your data. E.g. if you use only 10% randomly chosen rows then boxplot on this sample shouldn't differ from all-data boxplot.

If your data are on some database there you be able to create some random flag (as I know almost every database engine has some kind of random number generator).

Second thing is how large is your dataset? For boxplot you need two columns: value variable and group variable. This example:

N <- 1e6
x <- rnorm(N)
b <- sapply(1:100, function(i) paste(sample(letters,40,TRUE),collapse=""))
g <- factor(sample(b,N,TRUE))
boxplot(x~g)

needs 100MB of RAM. If N=1e7 then it uses <1GB of RAM (which is still manageable to modern machine).

score 1 · Answer 8 · answered Nov 01 '18 at 22:31

1

Perhaps you can think about using disk.frame to summarise the data down first before running the plotting?

answered Nov 01 '18 at 22:31

xiaodai

14,889
18
76
140

score 0 · Answer 9 · answered Aug 07 '22 at 00:59

The problem with R (and other languages like Python and Julia) is that you have to load all your data into memory to plot it. As of 2022, the best solution is to use DuckDB (there is an R connector), it allows you to query very large datasets (CSV, parquet, among others), and it comes with many functions to compute summary statistics. The idea is to use DuckDB to compute those statistics, load such statistics into R/Python/Julia, and plot.

Computing a boxplot with SQL + R

You need a bunch of statistics to plot a boxplot. If you want a complete reference, you can look at matplotlib's code. The code is in Python, but the code is pretty straightforward, so you'll get it even if you don't know Python.

The most critical piece are percentiles; you can compute those in DuckDB like this (just change the placeholders):

SELECT
percentile_disc(0.25) WITHIN GROUP (ORDER BY "{{column}}") AS q1,
percentile_disc(0.50) WITHIN GROUP (ORDER BY "{{column}}") AS med,
percentile_disc(0.75) WITHIN GROUP (ORDER BY "{{column}}") AS q3,
AVG("{{column}}") AS mean,
COUNT(*) AS N
FROM "{{path/to/data.parquet}}"

You need some other statistics to create the boxplot with all its details. For full implementation, check this (note: it's written in Python). I had to implement this for a package I wrote called JupySQL, which allows plotting very large datasets in Jupyter by leveraging SQL engines such as DuckDB.

Once you compute the statistics, you can use R to generate the boxplot.

Plotting of very large data sets in R

9 Answers9

Computing a boxplot with SQL + R

Linked