1

I need to plot a huge dataset (1 million data) according to two variables. I want :

  • An equivalent of geom_point to see the distribution of my data
  • A geom_smooth to see the global trend

My data is very concentrated in some areas of my graphs. However, since I have a lot of data, the geom_smooth should be valid across most areas of my graph (but not all).

I can use geom_point() for that, but it really takes a long time to plot, and can lead to misinterpreation since the graph needs to be zoomed in to see the the real position of the points.

set.seed(1)
library(data.table)
library(ggplot2)

d=data.table(a=c(sample(seq(1,1500,1),20000, replace=T),sample(seq(1998,2000,1),1000, replace=T),sample(seq(1,150,1),19000, replace=T)),
             b=c(sample(seq(1,2000,1),20000, replace=T),sample(seq(150,160,1),1000, replace=T),sample(seq(1100,1600,1),19000, replace=T)))

ggplot(d) + aes(x=a,y=b)+
  geom_point(shape = 1,alpha=0.2) +
  geom_smooth(col="black")

Here we have a dezoomed plot : it gives us the impression that the density of the left-part of the graph is rather homogeneous

small plot

While in reality, we have density variations inside this area Zoomed plot

The adress this, and to adress the rapidity problem of geom_point , I found the geom_hex() function.

ggplot(d) + aes(x=a,y=b)+
  geom_hex(bins=70,col="white") +
  geom_smooth(col="black")

First geom hex

Here, we can see that the hexagons to the right are very dense in terms of data, but we barely see that the left part is also denser than the rest of the graph.

To adress this problem, I have set another scale_fill_gradient() like said here. I set the gradient's limits to 0-150, considering that if there is more than 150 observations, the hexagon should be considered as dense.

ggplot(d) + aes(x=a,y=b)+
  geom_hex(bins=70,col="white") + 
  scale_fill_gradient(low="yellow", high="coral2",limits=c(0,150)) +
  geom_smooth(col="black")

The problem is that the hexagons that exceeds 150 observations are blue, but I need them red so the graph could be interpreted. I still want to have some nuance in my graph and keep a gradient for the hexagons with <150 observations (I don't want to have two colors).

second geom hex with fixed colors

Can someone help me with that ?

PS : I used ggthemr::ggthemr("pale") to have prettier graphs, so it is normal if the formating isn't the same for you.

PSS : this is dummy data, obviously my data isn't as boring and homogeneous as this (and the repartition of points is more complicated), I just did what I could to recreate the problem.

Dimitri
  • 135
  • 7

2 Answers2

1

If you want a bit more differentiation between lower values on the scale, you can use scale_fill_gradientn and play around with the colours and values arguments to get a result that works well with your data:

ggplot(d) + 
  aes(x = a, y = b) +
  geom_hex(bins = 70, colour = "white") + 
  scale_fill_gradientn(colours = c("white", 'yellow', 
                                   'gold', 'coral2', 'red2'), 
                       values = c(0, 0.01, 0.1, 0.9, 1)) +
  geom_smooth(colour = "black") +
  theme_minimal()

enter image description here

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • It doesn't answer my question. Like I said, this is dummy data and the densities in my real dataset are much more heterogenous than here. You could consider the right gray hexagon as the zones where my data is very concentrated, the left area as the area where my data is less concentrated, but still concentrated enough to be considered trustworthy and the rest being not enough concentrated to be trusted. Here, the hexagon to the right is still gray (so will be the most concentrated areas in my graph). – Dimitri Apr 27 '23 at 12:50
  • @Dimitri have you tried using `scale_fill_gradientn`? I will add an example of how that might work for you. – Allan Cameron Apr 27 '23 at 12:53
  • Okay, thanks, I didn't know the function. If I get it well, the `colours = c("white", 'yellow', "coral2"), values = c(0, 0.05, 1)` means that the white is associated to the minimum density, the yellow to the densities = 0.05*200=10 observations and 1 to the maximum density ? – Dimitri Apr 27 '23 at 12:58
  • Related: [Is it possible to define the "mid" range in scale_fill_gradient2()?](https://stackoverflow.com/a/21758729/1851712); [Increase resolution of color scale for values close to zero](https://stackoverflow.com/a/20584038/1851712) – Henrik Apr 27 '23 at 13:01
  • @Dimitri yes, in `values`, 0 is the minimum density and 1 is maximum density. The `colours` argument maps each point in `values` to a particular colour. I have shown an example of how to map your sample data so that low density is white/yellow, medium density is coral, and high density is red. – Allan Cameron Apr 27 '23 at 13:02
1

There’s a package {ggpointdensity} that could be helpful to you. geom_pointdensity() colors each point according to its number of neighbors with the calculated stat n_neighbors which you could transform as appropriate (maybe log10?)

library(ggpointdensity)

ggplot(d) + aes(x=a,y=b)+
  geom_pointdensity() +
  geom_smooth(col="black")+
  scale_color_viridis_c(trans = "log10")

enter image description here

JoFrhwld
  • 8,867
  • 4
  • 37
  • 32
  • Hello, thanks for this solution, it seems to work really well on the dummy dataset. However, rapidity is a real deal for me (I simplified my problem, I don't have 1 but 6 variables to plot according to ~30 explicative variables) , and this is really slow on 1 million observations. – Dimitri Apr 27 '23 at 13:32