3

I have ~ 5 very large vectors (~ 108 MM entries) so any plot/stuff I do with them in R takes quite long time.

I am trying to visualize their distribution (histogram), and was wondering what would be the best way to superimpose their histogram distributions in R without taking too long. I am thinking to first fit a distribution to the histogram, and then plot all the distribution line fits together in one plot.

Do you have some suggestions on how to do that?

Let us say my vectors are:

x1, x2, x3, x4, x5.

I am trying to use this code: Overlaying histograms with ggplot2 in R

Example of the code I am using for 3 vectors (R fails to do the plot):

n = length(x1)
dat <- data.frame(xx = c(x1, x2, x3),yy = rep(letters[1:3],each = n))
ggplot(dat,aes(x=xx)) + 
    geom_histogram(data=subset(dat,yy == 'a'),fill = "red", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'b'),fill = "blue", alpha = 0.2) +
    geom_histogram(data=subset(dat,yy == 'c'),fill = "green", alpha = 0.2)

but it takes forever to produce the plot, and eventually it kicks me out of R. Any ideas on how to use ggplot2 efficiently for large vectors? Seems to me that I had to create a dataframe, of 5*108MM entries and then plot, highly inefficient in my case.

Thanks!

Community
  • 1
  • 1
Dnaiel
  • 7,622
  • 23
  • 67
  • 126

1 Answers1

20

Here's a little snippet of Rcpp that bins data very efficiently - on my computer it takes about a second to bin 100,000,000 observations:

library(Rcpp)
cppFunction('
  std::vector<int> bin3(NumericVector x, double width, double origin = 0) {
    int bin, nmissing = 0;
    std::vector<int> out;

    NumericVector::iterator x_it = x.begin(), x_end;
    for(; x_it != x.end(); ++x_it) {
      double val = *x_it;
      if (ISNAN(val)) {
        ++nmissing;
      } else {
        bin = (val - origin) / width;
        if (bin < 0) continue;

        // Make sure there\'s enough space
        if (bin >= out.size()) {
          out.resize(bin + 1);
        }
        ++out[bin];
      }
    }

    // Put missing values in the last position
    out.push_back(nmissing);
    return out;
  }
')

x8 <- runif(1e8)
system.time(bin3(x8, 1/100))
#   user  system elapsed 
#  1.373   0.000   1.373 

That said, hist is pretty fast here too:

system.time(hist(x8, breaks = 100, plot = F))
#   user  system elapsed 
#  7.281   1.362   8.669 

It's straightforward to use bin3 to make a histogram or frequency polygon:

# First we create some sample data, and bin each column

library(reshape2)
library(ggplot2)

df <- as.data.frame(replicate(5, runif(1e6)))
bins <- vapply(df, bin3, 1/100, FUN.VALUE = integer(100 + 1))

# Next we match up the bins with the breaks
binsdf <- data.frame(
  breaks = c(seq(0, 1, length = 100), NA),
  bins)

# Then melt and plot
binsm <- subset(melt(binsdf, id = "breaks"), !is.na(breaks))
qplot(breaks, value, data = binsm, geom = "line", colour = variable)

FYI, the reason I had bin3 on hand is that I'm working on how to make this speed the default in ggplot2 :)

hadley
  • 102,019
  • 32
  • 183
  • 245
  • Thanks, looks very fast and nice ;-) love Rcpp wasn't familiar with it. – Dnaiel Dec 01 '12 at 16:20
  • @hadley a small mistake , you need to use doubles quotes to correct the code – agstudy Dec 01 '12 at 16:32
  • Guess we need to add comment embedded contractions to the list of reasons not to use single quotes. – IRTFM Dec 01 '12 at 16:43
  • This is v nice but it does not address the overlaying plots issue, I re-adapted the Q to see if more ideas can come, thanks! – Dnaiel Dec 01 '12 at 16:45
  • @Dnaiel use bin3 to compute the bins yourself, and then plot. – hadley Dec 01 '12 at 16:57
  • @DWin @agstudy doh - that's why I usually stick with separate C++ files & `sourceCpp` – hadley Dec 01 '12 at 16:58
  • @hadley thanks, but how do I produce a plot that overimpose them, color them and use the transparency as in ggplot2, seems ggplot2 only takes dataframes as data... – Dnaiel Dec 01 '12 at 17:09
  • @Dnaiel see my edit. I'd recommend a frequency polygon over transparent histograms. – hadley Dec 01 '12 at 17:45
  • Hm, can someone interpret this error for me: "Error in sourceCpp(code = code, env = env, rebuild = rebuild, showOutput = showOutput, : Error 1 occurred building shared library." – Rico Dec 03 '12 at 11:05
  • I thought that is the error - is there a command to access more? – Rico Dec 03 '12 at 13:44
  • @Rico you should normally get more than that. It may be because you don't have the r development tools installed. If on windows, you need [rtools](http://cran.r-project.org/bin/windows/Rtools/) – hadley Dec 03 '12 at 13:47
  • THanks @hadley - as I haven't heard of the tools before, this must be it. – Rico Dec 04 '12 at 10:04