3

I am trying to produce a 2-d density plot overlayed on a scatterplot in ggplot2.

I have the following working code:

plt<-ggplot(data=for_plot,aes(x=X, y=Y))+ 
  stat_density2d(aes(fill=..level..,alpha=..level..),geom='polygon',colour='black') + 
  scale_fill_continuous(low="green",high="red") +
  guides(alpha="none") +
  ylim(0.5,max(shortest_path_list$shortest_path)) +
  geom_point()

When I run the code with this dataset:

> for_plot[sample(nrow(for_plot), 20), ]
    Y   X
 1: 2 110182.549
 2: 3  95202.283
 3: 2  91557.371
 4: 1   6730.598
 5: 1   7396.081
 6: 1  13939.701
 7: 2   9767.561
 8: 3 101597.449
 9: 2  99368.467
10: 3 102024.722
11: 3  90491.076
12: 3  81337.624
13: 1   5956.710
14: 3  95160.149
15: 3  89981.055
16: 1   8823.615
17: 1  10717.879
18: 2  11463.036
19: 2   3864.292
20: 2  10351.874

It works fine, and gives me the following output: enter image description here

Note that my Y is discrete and X is continuous, so the plot is fine.

However, when I use this dataset:

> for_plot[sample(nrow(for_plot), 20), ]
    Y   X
 1: 1   9897.476
 2: 2   2350.191
 3: 1  13911.780
 4: 1  98885.336
 5: 1  94776.873
 6: 1 102804.832
 7: 1  99956.988
 8: 1  13941.653
 9: 1   9246.795
10: 1  13152.775
11: 1 113325.680
12: 1  82263.657
13: 1  91108.347
14: 1   8823.797
15: 1  11057.255
16: 1  99150.825
17: 2   7312.730
18: 2   6476.152
19: 1 113534.588
20: 1  91311.834 

I get the following error and the plot:

Warning message:
Computation failed in `stat_density2d()`:
bandwidths must be strictly positive

enter image description here

I know one of the ways of causing this error is usually if there is no variance in either X or Y direction. But, in this case there seems to be variation similar to the first case. I am hence not understanding what makes the first scenario work, but the second to fail. Is there a work around to get the contours in the second scenario?

Here are 2 scenarios with the minimal reproducible example as suggested by Mr. Flick:

Scenario 1 (the plot works):

set.seed(100)
> for_plot<-dput(for_plot[sample(nrow(for_plot), 20), ])
structure(list(Y = c(2, 2, 3, 1, 2, 
3, 3, 3, 2, 1, 3, 2, 2, 3, 1, 3, 2, 3, 2, 1), X = c(96649.7975713206, 
104758.02495167, 93351.5907987183, 5535.8146932624, 99480.6016841293, 
113103.505637801, 90445.3465777551, 81903.811792781, 106832.148472597, 
6576.45291001145, 99368.9134426028, 111130.390217174, 9471.82883910966, 
102087.415882298, 5657.05900168211, 107688.549964059, 103669.855375872, 
94121.8586312176, 1573.00051813297, 7394.05750749363)), .Names = c("Y", "X"), class = c("data.table", 
"data.frame"), row.names = c(NA, -20L), .internal.selfref = <pointer: 0x00000000065c0788>)

enter image description here

Scenario 2 (The plot does not produce desired output):

> for_plot<-dput(for_plot[sample(nrow(for_plot), 20), ])
structure(list(Y = c(1, 
    1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2), 
    X = c(96925.0119740431, 98869.1560687514, 99434.7995468473, 
    9123.65901167288, 111471.920587976, 109448.280478224, 6678.04323546572, 
    98309.4525934759, 91311.834287723, 86616.727265815, 101009.644050382, 
    7396.08053430818, 102517.086739334, 11504.3148787722, 9471.82883910966, 
    15427.4786153589, 96385.4989659007, 2249.38197350042, 91425.5491534976, 
    9303.7114788096)), .Names = c("Y", 
"X"), class = c("data.table", "data.frame"), row.names = c(NA, 
-20L), .internal.selfref = <pointer: 0x00000000065c0788>)

The error:

 Warning message:
    Computation failed in `stat_density2d()`:
    bandwidths must be strictly positive

enter image description here

Update

One way of getting the kernels to work, is to add some random noise to the Y variable so that the variance is no longer 0.

#Add variability for kernel density
rand_noise<-runif(nrow(for_plot), -0.1, 0.1)
for_plot$Y_noise<-for_plot$Y+rand_noise

Though the error goes away and kernels are produced, they are not nice and uniform like the scenario 1: enter image description here

As, I have mentioned in the comments, what really baffles me is why scenario I always work by default and scenario 2 never works by default. I have tried with different subsets of the data to verify this. The nature of the data is same in both scenario 1 and scenario 2.

DotPi
  • 3,977
  • 6
  • 33
  • 53
  • 1
    It would help if you included a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we can run the code to see what's going on. Try to use `set.seed()` if you are using random functions like `sample` so we can see the same thing you are. – MrFlick Oct 10 '16 at 22:13
  • @MrFlick: There are 4290 rows in each of the two tables I am using. Is there a way I can include that in a question? – DotPi Oct 10 '16 at 22:25
  • 1
    Are all 4290 required for reproducing the error? Seems unlikely. Just create a __minimal__ reproducible example. See the suggestions included in the link I provided. – MrFlick Oct 10 '16 at 22:26
  • 2
    I may be missing something, but are you sure it is correct to make a 2d density plot when one of the variables is discrete? I'd rather just use the alpha channel to highlight point density, maybe with some jitter. Or maybe a hex bin with low bin width.... – lbusett Oct 10 '16 at 22:26
  • @MrFlick: Ok. Let me try and create a minimal reproducible example. – DotPi Oct 10 '16 at 22:27
  • Also, there is always the option to upload the data somewhere and provide a link. – jakub Oct 10 '16 at 22:57
  • `geom_quasirandom` is my new and preferred jitter – Nate Oct 11 '16 at 02:57
  • @LorenzoBusetto: My boss likes the output of stat_density2d more than the ones produced by jitter or playing with the alpha channel. But, I am personally perplexed as to why there is a difference between the 2 scenarios. – DotPi Oct 12 '16 at 00:20

0 Answers0