6

Final implementation - not finished but heading the right way

Idea/Problem: You have a plot with many overlapping points and want to replace them by a plain area, therefore increasing performance viewing the plot.

Possible implementation: Calculate a distance matrix between all points and connect all points below a specified distance.

Todo/Not finished: This currently works for manually set distances depending on size of the printed plot. I stopped here because the outcome didnt meet my aesthetic sense.

Minimal example with intermediate plots

set.seed(074079089)
n.points <- 3000

mat <- matrix(rnorm(n.points*2, 0,0.2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")

d.mat <- dist(mat)
fit.mat <-hclust(d.mat, method = "single")
lims <- c(-1,1)
real.lims <- lims*1.1               ## ggplot invokes them approximately

# An attempt to estimate the point-sizes, works for default pdfs pdf("test.pdf")
cutsize <- sum(abs(real.lims))/100  
groups <- cutree(fit.mat, h=cutsize) # cut tree at height cutsize
# plot(fit.mat) # display dendogram

# draw dendogram with red borders around the 5 clusters
# rect.hclust(fit.mat, h=cutsize, border="red")

library(ggplot2)
df <- data.frame(mat)
df$groups <- groups
plot00 <- ggplot(data=df, aes(x,y, col=factor(groups))) + 
    geom_point() + guides(col=FALSE) +  xlim(lims) + ylim(lims)+ 
    ggtitle("Each color is a group")
pdf("plot00.pdf")
print(plot00)
dev.off()

plot00 - points with group color

# If less than 4 points are connected, show them seperately
t.groups <- table(groups)   # how often which group
drop.group <- as.numeric(names(t.groups[t.groups<4]))   # groups with less than 4 points are taken together
groups[groups %in% drop.group] <- 0                     # in group 0
df$groups <- groups
plot01 <- ggplot(data=df, aes(x,y, col=factor(groups))) + 
    geom_point() + xlim(lims)+ ylim(lims) + 
    scale_color_hue(l=10)
pdf("plot01.pdf")
print(plot01)
dev.off()

plot01 - all single points in one group

find_hull <- function(df_0) 
{
    return(df_0[chull(df_0$x, df_0$y), ])
}


library(plyr)
single.points.df <- df[df$groups == 0 , ]
connected.points.df <- df[df$groups != 0 , ]
hulls <- ddply(connected.points.df, "groups", find_hull) #  for all groups find a hull
plot02 <- ggplot() + 
    geom_point(data=single.points.df, aes(x,y, col=factor(groups))) + 
    xlim(lims)+ ylim(lims) + 
    scale_color_hue(l=10)
pdf("plot02.pdf")
print(plot02)
dev.off()

plot02 - only "single"-points (less than 4 connected points)

plot03 <- plot02
for(grp in names(table(hulls$groups)))
{
    plot03 <- plot03 + geom_polygon(data=hulls[hulls$groups==grp, ],
                                    aes(x,y), alpha=0.4)
}
# print(plot03)
plot01 <- plot01 + theme(legend.position="none")
plot03 <- plot03 + theme(legend.position="none")
# multiplot(plot01, plot03, cols=2)
pdf("plot03.pdf")
print(plot03)
dev.off()

plot03 - final

Initial Question

I have a (maybe odd) question.

In some plots, I have thousands of points in my analysis. To display them, the pc takes quite a bit of time because there are so many points. After now, many of these points can overlap, I have a filled area (which is fine!). To save time/effort displaying, it would be usefull to just fill this area but plotting each point on its own.

I know there are possibilities in heatmaps and so on, but this is not the idea I have in mind. My idea is something like:

#plot00: ggplot with many many points and a filled area of points
plot00 <- plot00 + fill.crowded.areas()

# with plot(), I sadly have an idea how to manage it

Any ideas? Or is this nothing anyone would do anytime?

# Example code
# install.packages("ggplot2")
library(ggplot2)

n.points <- 10000
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)
plot00 <- ggplot(df, aes(x=x, y=y)) + 
    theme_bw()  +                       # white background, grey strips
    geom_point(shape=19)# Aussehen der Punkte

print(plot00)

ggplot2

# NO ggplot2
plot(df, pch=19)

plot

Edit:
To have density-plots like mentioned by fdetsch (how can I mark the name?) there are some questions concerning this topic. But this is not the thing I want exactly. I know my concern is a bit strange, but the densities make a plot more busy sometimes as necessary.

Links to topics with densities:

Scatterplot with too many points
High Density Scatter Plots

Community
  • 1
  • 1
groebsgr
  • 131
  • 1
  • 14
  • What you ask for is difficult because rendering a solid block depends on the graphical parameters (e.g. point size) that you choose. To create solid areas you need to buffer the points into a single layer in the same way as a Geographical Information System - [this question](https://stackoverflow.com/questions/25411251/buffer-geospatial-points-in-r-with-gbuffer) might help. – geotheory Jan 27 '16 at 12:57

3 Answers3

4

How about using panel.smoothScatter from lattice? It displays a certain number of points in low-density regions (see argument 'nrpoints') and everywhere else, point densities are displayed rather than single (and possibly overlapping) points, thus providing more meaningful insights into your data. See also ?panel.smoothScatter for further information.

## load 'lattice'
library(lattice)

## display point densities
xyplot(y ~ x, data = df, panel = function(x, y, ...) {
  panel.smoothScatter(x, y, nbin = 250, ...)
})

enter image description here

fdetsch
  • 5,239
  • 3
  • 30
  • 58
  • I'm aware of this "density-plotting", it's nice in my opinion, but it's not the thing I have in mind. (sorry :D) - ty nevertheless! – groebsgr Jan 27 '16 at 11:07
1

Do you mean something like the convex hull of your points:

Convex Hull of points

set.seed(1337)
n.points <- 100
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)
ch <- chull(df$x, df$y) # This computes the convex hull

require(ggplot2)
ggplot() + 
  geom_point(data=df, aes(x,y)) +
  geom_polygon(data = df[ch,], aes(x,y), alpha=0.5)
Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • I tried now for about 20 mins. It could! lead to the result i want. First, i want to replace the points below the polygon. Secondly, how would I find connected areas? Maybe a distance-matrix and only the small distances for a polygon? – groebsgr Jan 27 '16 at 11:45
1

You could use a robust estimator to estimate the location of the majority of your points and plot the convex hull of the points as follows:

set.seed(1337)
n.points <- 500
mat <- matrix(rexp(n.points*2), nrow=n.points, ncol=2)
colnames(mat) <- c("x", "y")
df <- data.frame(mat)

require(robustbase)
my_poly <- function(data, a, ...){
  cov_rob = covMcd(data, alpha = a)
  df_rob = data[cov_rob$best,]
  ch = chull(df_rob$x, df_rob$y)
  geom_polygon(data = df_rob[ch,], aes(x,y), ...)
}

require(ggplot2)
ggplot() + 
  geom_point(data=df, aes(x,y)) +
  my_poly(df, a = 0.5, fill=2, alpha=0.5) +
  my_poly(df, a = 0.7, fill=3, alpha=0.5) 

This leads to:

enter image description here

by controlling the alpha-value of covMcd you can increase/decrease the size of the area. See ?robustbase::covMcd for details. Btw.: Mcd stands for Minimum Covariance Determinant. Instead of it you can also use MASS::cov.mve to calculate the minimum valume ellipsoid with MASS::cov.mve(..., quantile.used=-percent of points within the ellipsoid.

For 2+ classes:

my_poly2 <- function(data, a){
  cov_rob = covMcd(data, alpha = a)
  df_rob = data[cov_rob$best,]
  ch = chull(df_rob[,1], df_rob[,2])
  df_rob[ch,]
}

ggplot(faithful, aes(waiting, eruptions, color = eruptions > 3)) +
  geom_point() + 
  geom_polygon(data = my_poly2(faithful[faithful$eruptions > 3,], a=0.5), aes(waiting, eruptions), fill = 2, alpha = 0.5) +
  geom_polygon(data = my_poly2(faithful[faithful$eruptions < 3,], a=0.5), aes(waiting, eruptions), fill = 3, alpha = 0.5)

enter image description here

Or if you are ok with un-robust ellipsoids have a look at stat_ellipse

Rentrop
  • 20,979
  • 10
  • 72
  • 100
  • nice idea, but what if I have 2+ areas? – groebsgr Jan 27 '16 at 12:19
  • that one seems quite good, I will try in the next time, at the moment there's too much stuff todo, but I will probably give a final implementation how I did this. Thank you! – groebsgr Jan 27 '16 at 14:16