1

It's pretty easy to build a nice huge scatterplot matrix with histograms down the diagonal for multivariate data as follows:

scatterplotMatrix(somedata[1:points.count,],groups=somedata[1:points.count,class],
                by.groups=TRUE,diagonal="histogram")

According to the documentation though, it doesn't seem possible to divide up the histogram by the group labels as is done in this question. How would you do that using scatterplotMatrix or a similar function?

Community
  • 1
  • 1
bright-star
  • 6,016
  • 6
  • 42
  • 81
  • 2
    Check out ```GGally``` package for the function ```ggpairs()```. [This question has an interesting solution also](http://stackoverflow.com/questions/11503902/colouring-ggplots-plotmatrix-by-k-means-clusters). Look for the answer with the plotmatrix2 function. – marbel Jan 12 '14 at 23:52

2 Answers2

2

Is this what you had in mind?

Using the iris dataset:

library(ggplot2)
library(data.table)
library(reshape2)  # for melt(...)
library(plyr)      # for .(...)

xx <- with(iris, data.table(id=1:nrow(iris), group=Species, 
           Sepal.Length, Sepal.Width,Petal.Length, Petal.Width))
# reshape for facetting with ggplot
yy <- melt(xx,id=1:2, variable.name="H", value.name="xval")
yy <- data.table(yy,key="id,group")
ww <- yy[,list(V=H,yval=xval),key="id,group"]
zz <- yy[ww,allow.cartesian=T]
setkey(zz,H,V,group)
zz <- zz[,list(id, group, xval, yval, min.x=min(xval), min.y=min(yval),
               range.x=diff(range(xval)),range.y=diff(range(yval))),by="H,V"]
# points colored by group (=species)
# density plots for each variable by group
d  <-  zz[H==V, list(x=density(xval)$x,
          y=mean(min.y)+mean(range.y)*density(xval)$y/max(density(xval)$y)),
          by="H,V,group"]
ggp = ggplot(zz)
ggp = ggp + geom_point(subset  =.(H!=V), 
                       aes(x=xval, y=yval, color=factor(group)), 
                       size=3, alpha=0.5)
ggp = ggp + geom_line(subset = .(H==V), data=d, aes(x=x, y=y, color=factor(group)))
ggp = ggp + facet_grid(V~H, scales="free")
ggp = ggp + scale_color_discrete(name="Species")
ggp = ggp + labs(x="", y="")
ggp

I keep hearing that the same thing is possible using ggpairs(...) in package GGally. I would love to see an actual example of it. The documentation is inscrutable. Also, ggpairs(...) is extremely slow (in my hands), especially with large datasets.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • It's beautiful. That's what I was looking for. – bright-star Jan 13 '14 at 05:35
  • Yeah, I'm sorry I haven't gotten to it yet. My policy is to at least *try* out solutions before I accept, although I upvote on the spot. – bright-star Jan 15 '14 at 07:00
  • This is pretty hard on memory, huh? With that cartesian product, you get N*(M-1) points (where M is the number of feature types)? – bright-star Jan 18 '14 at 06:55
  • 1
    @TrevorAlexander - N*M^2 actually. This is why data.tables are used for the joins. In practical terms: if you have say 1e5 points and 10 feature types, still only 1e7 rows in `zz`. Consider what this would look like, with a 10 X 10 matrix of tiny facets, each containing 10,000 points. – jlhoward Jan 18 '14 at 19:01
1

For later reference, the GGally way to do it is as follows:

require(ggpairs)
tmp <- data.table(a = runif(30),b = runif(30), c = runif(30)+1, 
                  d = as.factor(sample(0:1,size=30, replace=TRUE)))

ggpairs(data=tmp, diag=list(continuous="density"), columns=1:3, colour="d",
        axisLabels="show")

pairwise scatterplot matrix with group densities on diagonal

This intrepid asker figured out that you have to enable axisLabels which is somewhat silly, given the aesthetic emphasis of ggplot and friends.

Now I want to know how to parallelize this, because it's a monster with high numbers of variables.

Community
  • 1
  • 1
bright-star
  • 6,016
  • 6
  • 42
  • 81