5

Is there a way to make the density() function in R use counts vs. probability?

For example, I have two options when examining density distributions using the histogram function hist:

hist(x,freq=F)  #"graphic is a representation of frequencies, the counts component of the result"
hist(x,freq=T)  #"probability densities, component density, are plotted (so that the histogram has a total area of one)"

I'm wondering if there is a way to do something similar using the density function?

In my specific example, I have counts of trees with varying diameters. (I'll note that I kept my data as a continuous scale of sizes vs. lumping them into discrete size classes). When I use the density funciton with this data (i.e., plot(density(dat$D,na.rm=T,from=0))) it gives me a density estimation of probabilities for each size (of course smoothed out). I'm more interested in reporting this data as stems/area vs. probability, so I'd prefer the density estimates to use counts.

Thoughts??


UPDATE:

Here are some real example data:

 dat <- c(6.6, 7.1, 8.4, 27.4, 11.9, 18.8, 8.9, 25.4, 8.9, 8.6, 11.4, 19.3, 7.6, 42.2, 20.8, 25.1, 38.1, 42.2, 5.2, 34.3, 42.7, 34, 37.3, 45.5, 39.4, 25.1, 30.7, 23.1, 43.4, 19.6, 30.5, 23.9, 10.7, 18.3, 30, 35.8, 8.1, 11.9, 28.4, 30.5, 34.3, 10.4, 45, 38.9, 8.9, 11.7, 9.7, 7.4, 3.8, 20.6, 48.8, 6.6, 40.4, 13, 16, 8.6, 16, 13, 12.2, 11.4, 10.2, 22.6, 17.3, 12.4, 9.7, 17.3, 10.9, 27.2, 9.1, 13, 10.9, 15, 10.4, 27.2, 21.6, 18.8, 12.7, 15.5, 17, 16.3, 18, 26.9, 10.2, 21.3, 19, 11.7, 10.7, 18, 9.9, 16.5, 19.6, 22.1, 9.9, 18.3, 17, 6.9, 7.6, 12.7, 13.2, 9.7, 13.5, 18.3, 19.3, 30, 20.1, 18.5, 12.2, 16, 17, 14.2, 5.6, 12.2, 7.6, 17, 14, 16.5, 13.7, 11.9, 14.2, 15, 13.7, 13.2, 9.1, 6.9, 9.9, 11.4, 12.7, 10.2, 12.4, 15, 20.1, 6.9, 8.1, 11.4, 10.7, 10.9, 18.3, 9.1, 6.3, 17.3, 20.1, 9.4, 7.1, 16, 15, 10.9, 14.7, 18.8, 14.5, 10.7, 14, 10.4, 14.5, 15.7, 10.9, 14.7, 19.3, 12.4, 7.1, 14, 15.5, 36.8, 23.1, 7.9, 9.9, 8.1, 14.7, 13.7, 18, 10.7, 11.9, 12.7, 12.4, 17.8, 7.9, 12.2, 10.4, 13, 14.7, 12.7, 8.1, 14.2, 10.2, 11.9, 5.6, 8.4, 6.1, 7.6, 7.9, 19.8, 7.4, 12.7, 10.2, 12.4, 10.4, 12.4, 26.9, 12.7, 16.8, 22.9, 15.7, 10.4, 13.7, 8.1, 13.7, 14.2, 21.6, 20.8, 12.4, 10.9, 10.2, 29.5, 19.3, 8.9, 6.1, 11.2, 7.1, 28.7, 15.7, 10.4, 8.6, 10.4, 9.1, 14.5, 25.7, 11.4, 15.5, 8.1, 13.2, 16.8, 5.8, 20.8, 10.2, 9.1, 5.6, 14.5, 14.5, 17.5, 29.2, 13, 14, 12.4, 9.9, 21.1, 18.8, 14, 15.5, 9.7, 24.1, 20.1, 20.3, 12.4, 15.2, 15.7, 8.6, 8.6, 10.4, 12.4, 16.8, 4.1, 8.1, 6.6, 11.7, 7.9, 17.5, 9.1, 4.6, 7.1, 7.6, 9.4, 20.8, 11.4, 15.5, 7.1, 18.5, 7.9, 16.5, 6.3, 6.1, 16.5, 15.5, 17.3, 20.3, 12.7, 20.3, 13.7, 8.4, 16.8, 14, 18, 10.9, 19.8, 10.7, 27.2, 11.4, 7.9, 11.2, 14.5, 14.2, 11.2, 13.5, 18.5, 4.3, 7.9, 6.1, 9.9, 14.7, 8.4, 14, 12.4, 15, 14.2, 11.4, 7.6, 12.7, 5.8, 16, 7.9, 3.3, 5.8, 4.8, 4.8, 7.4, 9.1, 8.4, 3.8, 9.1, 9.4, 8.4, 9.9, 7.9, 13.2, 20.8, 18.3, 16.8, 13.5, 12.4, 8.1, 6.3, 7.6, 18.5, 14, 10.2, 9.4, 11.9, 11.4, 13, 14.5, 17, 7.9, 10.2, 7.4, 5.3, 6.9, 17.8, 5.6, 10.9, 9.9, 9.9, 16.5, 8.9, 24.1, 22.9, 13.5, 10.7, 23.4, 10.9, 28.2, 5.6, 19.6, 15.2, 6.3, 23.1, 19.3, 26.7, 30.5, 13.7, 7.9, 20.8, 19.8, 21.6, 21.6, 9.9, 30.5, 16.3, 11.9, 5.1, 15.2, 13.2, 7.1, 5.8, 9.9, 19.3, 15.5, 25.7, 14, 29.7, 11.9, 12.7, 25.9, 16.3, 25.9, 6.1, 26.7, 7.9, 9.7, 22.1, 20.1, 24.4, 17.3, 13.2, 16.5, 16.8, 21.8, 15.2, 9.9, 19.6, 23.6, 23.4, 17.8, 15.5, 11.4, 20.8, 22.1, 26.4, 12.4, 14.2, 6.9, 22.1, 22.6, 34.5, 15, 13.2, 19.6, 18.3, 15.5, 13.5, 14, 19.8, 21.1, 16.3, 19.8, 13.7, 12.2, 11.7, 31.7, 12.7, 13.2, 7.6, 12.2, 13.2, 31.7, 9.9, 10.2, 9.1, 9.1, 21.6, 8.6, 12.7, 13.5, 9.7, 8.9, 11.7, 8.4, 19.6, 7.6, 13.2, 18.3, 11.2, 22.4, 10.9, 14.7, 12.7, 16.8, 18.8, 15, 8.1, 20.8, 22.1, 7.6, 16.3, 10.9, 8.9, 11.7, 24.4, 29, 29.2, 27.4, 25.1, 6.6, 11.7, 16.5)

Here is attempting to try the method that @eipi10 suggests:

#Produce graph showing counts of values using table():
  plot(x=names(table(dat)), y = table(dat),type='l')
#Produce graph showing counts of values using density + @eipi10's method
  dens <- density(x = dat, na.rm = T, bw = 0.1, n = length(dat))
  dens$y <- length(dat)/sum(dens$y) * dens$y  #"fix" to counts
  plot(dens)

This code creates the following 2 graphs [titled post-hoc]:

enter image description here

As you can see, the two approaches come up with different values on the y axis. In other words, @eipi10's approach is not working for me :(.

theforestecologist
  • 4,667
  • 5
  • 54
  • 91
  • if break them into discrete classes, then all you have to do is integrate the curve for each class. – Rorschach Sep 11 '15 at 23:52
  • This is a very close match: `plot(x=as.numeric(names(table(dat))), y = table(dat),type='l'); lines(dens$x,dens$y*sum(dens$y)/diff(dens$x)[1],col=2)` – Ben Bolker Jul 26 '17 at 02:25
  • @BenBolker Thanks for the alternate approach. However, I can't get it to perform consistently with other data sets. Anyway, ultimately, I don't just want the two graphs to look similar. I want to *actually* be able to convert the probability values that `density()` outputs into *actual* counts. Is this at all possible? – theforestecologist Jul 26 '17 at 03:32

2 Answers2

7

You can convert to counts by normalizing the density values to the number of values in your sample. For example:

# Fake data
k=1000
set.seed(104)
val = rnorm(k)
dens = density(val, n=512)

# Convert to counts
dens$y = k/sum(dens$y) * dens$y

plot(dens)

But remember that the counts you end up with depend on how finely you divide the x-axis (which depends on the n argument to density). You can determine delta-x with mean(diff(dens$x)) (the intervals don't really vary, but they're not all exactly the same due to rounding error).

UPDATE: In light of your comment, the code below should explain what's going on. But first, note that the counts you get when binning your actual data will not (in general) match the counts derived from the kernel density estimate unless the binning intervals for the actual data are the same as those used for the kernel density estimate. (The counts are unlikely to match exactly in any case, due to the smoothing in the kernel density estimate, but the binning intervals need to be the same in order to get a close correspondence.)

library(ggplot2)
library(reshape2)
library(dplyr)

# Fake data
k=1000
set.seed(104)
dat = data.frame(diameter = rnorm(k,100,10))

Create 3 kernel density estimates: First two use 20 and 100 points, respectively. The third uses 100 points, but with 1/10th the default bandwidth.

# Convert density to counts
ctc = function(data, nPoints, numValues, adj=1) {
  dens = density(data$diameter, n=nPoints, adjust=adj)
  dens$y = numValues/sum(dens$y) * dens$y
  return(dens)
}

dens20 = ctc(dat, 20, k)
dens100 = ctc(dat, 100, k)
dens100adj = ctc(dat, 100, k, 0.1)

Create data frames with the actual counts and counts estimated from the kernel density estimate. We'll use the cut function to ensure that the actual counts use the same intervals as the kernel density estimates.

dd = function(data, dens) {
  data = data.frame(table(cut(data$diameter, 
                              breaks=c(dens$x - 0.5*mean(diff(dens$x)),Inf))),
                    DensityCounts=round(dens$y,1))  # Rounding is just for easier comparison by eye if you display the data frame
  names(data)[1:2] = c("DiameterRange","ActualCounts")
  return(data)
}

dat20 = dd(dat, dens20)
dat100 = dd(dat, dens100)
dat100adj = dd(dat, dens100adj)

Now we create plots comparing each of the kernel density estimates with the actual counts. Notice when the actual counts match the counts created from the density estimates, and how that's affected by both the bandwidth and how fine an interval we use.

pf = function(data, title) {
  ggplot(data %>% melt(id.var="DiameterRange"), 
         aes(DiameterRange, value, colour=variable, group=variable)) +
    geom_line() +
    theme(axis.text.x=element_text(angle=-90, vjust=0.5, hjust=0)) +
    ggtitle(title)
}

gridExtra::grid.arrange(pf(dat20, "n=20"), 
                        pf(dat100, "n=100"), 
                        pf(dat100adj, "n=100; 1/10th default bandwidth"))

enter image description here

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • this does not seem to be working. When I have a count of 7, my dens$y for that x-value is 0.2865707498, but when I use your equation the result is 2.544275344. I would expect the result to be 7. Am I just approaching this incorrectly?? – theforestecologist Sep 13 '15 at 02:11
  • Maybe it would help if you posted a sample of your data. When you say you have a "count of 7", within what interval (range of diameters in your case)? Is the interval the same as the interval in your density estimate? If not, the counts can be different. Actually, the counts will be somewhat different in any case, due to the smoothing of kernel density estimate. – eipi10 Sep 13 '15 at 15:50
  • Remember, the density estimate is not the probability of an X-value (where X is the diameter in this case) being exactly equal to some value (in this case 0.2865). It's the probability of it falling between Xi and Xi+1, where that range is (in the case of a kernel density estimate) `mean(diff(dens$x))`. When we convert to counts, we get the probability of a given number of counts falling within Xi and Xi+1 (given the number of data points in your sample and their particular values). – eipi10 Sep 13 '15 at 15:50
  • (We haven't talked about the bandwidth, but that also affects the result. R determines that by a default method, but you can change it.) – eipi10 Sep 13 '15 at 15:50
  • Hey @eipi10, I've finally decided to revisit this. I've made updates to my question including providing example data and a demonstration of how I can;t get your method to work with my data. Any additional help you could provide would be much appreciated! – theforestecologist Jul 26 '17 at 02:16
0

Unless you specificaly need the density function, maybe you can use the function table

Counts<-table(factor(dat$D,levels=0:n)) # n=number of size levels

plot(Counts,type="l")

This will give you the counts for each diameter, but it not smoothed.

tharindu_DG
  • 8,900
  • 6
  • 52
  • 64
Marianela
  • 35
  • 6