Fit distribution to given frequency values in R

Question

I have frequency values changing with the time (x axis units), as presented on the picture below. After some normalization these values may be seen as data points of a density function for some distribution.

Q: Assuming that these frequency points are from Weibull distribution T, how can I fit best Weibull density function to the points so as to infer the distribution T parameters from it?

sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
            611,1037,727,489,432,371,1125,69,595,624)

plot(1:length(sample), sample, type = "l")
points(1:length(sample), sample)

enter image description here

Update. To prevent from being misunderstood, I would like to add little more explanation. By saying I have frequency values changing with the time (x axis units) I mean I have data which says that I have:

7787 realizations of value 1
3056 realizations of value 2
2359 realizations of value 3 ... etc.

Some way towards my goal (incorrect one, as I think) would be to create a set of these realizations:

# Loop to simulate values 
set.values <- c()
for(i in 1:length(sample)){
  set.values <<- c(set.values, rep(i, times = sample[i]))
}

hist(set.values)
lines(1:length(sample), sample)
points(1:length(sample), sample)

enter image description here

and use fitdistr on the set.values:

f2 <- fitdistr(set.values, 'weibull')
f2

Why I think it is incorrect way and why I am looking for a better solution in R?

in the distribution fitting approach presented above it is assumed that set.values is a complete set of my realisations from the distribution T
in my original question I know the points from the first part of the density curve - I do not know its tail and I want to estimate the tail (and the whole density function)

Do you know the exact value where the first part of the density curve ends and the tail begins? Your sample ends at value 22: can I assume that the tail begins at 23? — renato vitolo, May 05 '15 at 08:17
I am afraid I do not understand (I am not aware of a formal definition of "distribution tail" I could use here). My eventual goal is to compute expected value of the variable which is of distribution `T`. Maybe it is reasonalbe to assume that first part (part between 1. and 2. points in the histogram above) is linear and the latter part - Weibull (Weibull is an asumption I was given from someone who provided me with data. I wouldn't bet my life for this but I am inclined to assume the same.) — Marta Karas, May 05 '15 at 09:48
You say: "in my original question I know the points from the first part of the density curve". What do you mean exactly by "first part"? At what value does the "first part" stop? You also say: "I do not know its tail and I want to estimate the tail (and the whole density function)". For that you need (a criterion) to select where the tail begins. — renato vitolo, May 05 '15 at 21:15
I kind of think I have answered it. In what way is my solution not what you are looking for? — Mike Wise, May 09 '15 at 22:23
@MikeWise , I agree with your opinion. I am very glad you got involved in this discussion - you not only provided a "working" solution, but also contributed by sharing your thoughts multiple times. Thank you a lot! :) — Marta Karas, May 10 '15 at 07:18
And thank you. Was actually the most fun question I have ever answered. I am also involved in a Predictive Maintenance project, if you want to share thoughts outside of SO, ping me. — Mike Wise, May 10 '15 at 07:57

Mike Wise · Accepted Answer · 2015-05-15T14:29:52.157

3

First try with all points

Second try with first point dropped Here is a better attempt, like before it uses optim to find the best value constrained to a set of values in a box (defined by the lower and upper vectors in the optim call). Notice it scales x and y as part of the optimization in addition to the Weibull distribution shape parameter, so we have 3 parameters to optimize over.

Unfortunately when using all the points it pretty much always finds something on the edges of the constraining box which indicates to me that maybe Weibull is maybe not a good fit for all of the data. The problem is the two points - they ares just too large. You see the attempted fit to all data in the first plot.

If I drop those first two points and just fit the rest, we get a much better fit. You see this in the second plot. I think this is a good fit, it is in any case a local minimum in the interior of the constraining box.

library(optimx)
sample <- c(60953,7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
            611,1037,727,489,432,371,1125,69,595,624)
t.sample <- 0:22

s.fit <- sample[3:23]
t.fit <- t.sample[3:23]

wx <- function(param) { 
  res <- param[2]*dweibull(t.fit*param[3],shape=param[1])
  return(res)
} 
minwx <- function(param){
  v <- s.fit-wx(param)
  sqrt(sum(v*v))
}

p0 <- c(1,200,1/20)
paramopt <- optim(p0,minwx,gr=NULL,lower=c(0.1,100,0.01),upper=c(1.1,5000,1))

popt <- paramopt$par
popt
rms <- paramopt$value
tit <- sprintf("Weibull - Shape:%.3f xscale:%.1f  yscale:%.5f rms:%.1f",popt[1],popt[2],popt[3],rms)

plot(t.sample[2:23], sample[2:23], type = "p",col="darkred")
lines(t.fit, wx(popt),col="blue")
title(main=tit)

edited May 15 '15 at 14:29

answered May 03 '15 at 10:33

Mike Wise

22,131
8
81
104

Hi @Mike Wise , thank you for your interest and this full example! As you can see, this is hard to fit the curve by that way - in my opinion the curve fitted does not fit well as it is not "bend" enough. I believe it should be far more like the blue cirve from [here](http://upload.wikimedia.org/wikipedia/commons/thumb/5/58/Weibull_PDF.svg/325px-Weibull_PDF.svg.png), isn't it? – Marta Karas May 03 '15 at 10:41
plus1 for the full example with possible more general use purposes – Marta Karas May 03 '15 at 10:48
Yes I think you have to use external knowledge to constrain the fit. Like do you have prior knowledge of probabilities if the failure rate is constant, or increasing, or decreasing. That is venturing into the realm of Bayesian estimation I believe, and is beyond me at this point, but I am working my way there :) – Mike Wise May 03 '15 at 10:51
There are some interesting discussion of fitting Weibull using other tools in this forum. But I will need time to digest them. – Mike Wise May 03 '15 at 10:52
Thank you, @Mike Wise! I would be really grateful if you could mention here any interesting references if you find something. I really appreciate yor help. – Marta Karas May 03 '15 at 10:54
Took another shot at it. – Mike Wise May 03 '15 at 12:37
1

Wow, I have just realize *I think only the tail is Weibull* may be a very good point! Thank you. I will investigate it and your solution further within a few days. – Marta Karas May 03 '15 at 17:52
@Mike Wise: any reason why you set t.sample <- 0:22? Using t.sample <- 1:23 and starting t.fit, s.fit from 1:23 I obtain a decent fit throughout, included the first point. – renato vitolo May 05 '15 at 21:55
Um, t.sample <- 0:22 was what the OP posted in the original problem, so I just used that and did not change it as I considered it as one of the givens. Now I see OP has changed the problem around and things are a bit different now, the t axis scale can vary. I need to think about that a bit. – Mike Wise May 06 '15 at 04:11
1

I have some more ideas, may get around to trying them tomorrow or this evening. – Mike Wise May 07 '15 at 07:14
1

Tried to fit two Weibull's at once to handle those first two points, but could not get convergence. – Mike Wise May 09 '15 at 11:19
1

You can get other good fits by changing the x and y scales around a bit. Would be helpful to know more about the time scale (what was the origin etc). If this was my project I would probably do bootstrapping on these fits to get a parameter bounds and distributions. – Mike Wise May 09 '15 at 11:25

score 3 · Answer 2 · answered May 06 '15 at 07:43

3

You can directly calculate the maximum likelihood parameters, as described here.

# Defining the error of the implicit function
k.diff <- function(k, vec){
  x2 <- seq(length(vec))
  abs(k^-1+weighted.mean(log(x2), w = sample)-weighted.mean(log(x2), 
                                                            w = x2^k*sample))
}

# Setting the error to "quite zero", fulfilling the equation
k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min

# Calculate lambda, given k
l <- weighted.mean(seq(length(sample))^k, w = sample)

# Plot
plot(density(rep(seq(length(sample)),sample)))
x <- 1:25
lines(x, dweibull(x, shape=k, scale= l))

answered May 06 '15 at 07:43

user1965813

671
5
16

It does not work until after I run my code. No idea why. The error message is: k <- optimize(k.diff, vec=sample, interval=c(0.1,5), tol=10^-7)$min Error in as.double(w) : cannot coerce type 'closure' to vector of type 'double' – Mike Wise May 09 '15 at 11:28
I get the error message: Error in as.double(w) : cannot coerce type 'closure' to vector of type 'double' – Mike Wise May 09 '15 at 11:29
Hi @user1965813 , thank you for your answer! I was able to reproduce your code. I also reproduced the code for the sample with the first element removed (as in the discussion there is an opinion that first poin does not "fit" the rest and I am inclined to this thoughts), [see here](http://paste.ubuntu.com/11057419/). Then I compared the shapes of [these dendisty plots](http://i.imgur.com/LwROnSE.png) and it seems Mike's solution gives more appropriate results in this case. Nevertheless, thank you a lot for sharing this approach! – Marta Karas May 10 '15 at 07:16

TinaW · Answer 3 · 2015-05-03T13:47:45.763

1

Assuming the data are from a Weibull distribution, you can get an estimate of the shape and scale parameter like this:

sample <- c(7787,3056,2359,1759,1819,1189,1077,1080,985,622,648,518,
        611,1037,727,489,432,371,1125,69,595,624)
 f<-fitdistr(sample, 'weibull')
 f

If you are not sure whether it is distributed Weibull, I would recommend using the ks.test. This tests whether your data is from a hypothesised distribution. Given your knowledge of the nature of the data, you could test for a few selected distributions and see which one works best.

For your example this would look like this:

 ks = ks.test(sample, "pweibull", shape=f$estimate[1], scale=f$estimate[2])
 ks

The p-value is insignificant, hence you do not reject the hypothesis that the data is from a Weibull distribution.

Update: The histograms of either the Weibull or exponential look like a good match to your data. I think the exponential distribution gives you a better fit. Pareto distribution is another option.

f<-fitdistr(sample, 'weibull')
z<-rweibull(10000, shape= f$estimate[1],scale= f$estimate[2])
hist(z)

f<-fitdistr(sample, 'exponential')
z = rexp(10000, f$estimate[1]) 
hist(z)

edited May 03 '15 at 13:47

answered May 03 '15 at 09:37

TinaW

989
1
18
28

Hmm I am afraid I made a mistake in acknowledging this answer as correct. The `fitdistr` function treats the values (here: values from `sample` vector) as **realizations** from distribution `T` (in other words: points drawn drom distribution `T`), not: **data points of a density function curve** for some distribution. See that when I use estimated `shape` and `scale` parameters to drawn points from estimated `T` and *then* plot density for that points (which is **not** the case of my question), I end up with density like [this](http://imgur.com/h9Rxta9), where x-axis values ale not right. – Marta Karas May 03 '15 at 10:06
What do you mean by : "data points of a density function curve for some distribution"? In your question you said you think it is Weibull. The pdf is for a Weibull with the estimated parameters. If you want to compare it with your graph, you need to compare it with hist(sample). Your graph above does not look like a pdf. – TinaW May 03 '15 at 10:15
Hi @TinaW , I kindly refer you to the update I have just added to my question. – Marta Karas May 03 '15 at 10:34
What makes you think this is Weibull distributed? – TinaW May 03 '15 at 11:59
I think only the tail is. – Mike Wise May 03 '15 at 12:37
Can you post further details on: 1) the untransformed series (I understand that "sample" which you posted is the transformed data following an unknown distribution. Can you post the untransformed one, which does follow Weibull); 2) Details about the transformation you performed to get "sample", i.e. what means "some normalisation". – TinaW May 03 '15 at 21:11
Hi @TinaW, this is the origin data I have obtained (data about number of users who continued to use a service for a particular number of time units (`x`-axis)). By "some normalization" I meant only the idea that after normalizing the values (number of users) we would get some weights which may somehow correspond to the density function values. It's like you plot a histogram with the use of `hist` function in `R` and you may choose between `freq=TRUE` or `probability=TRUE` parameters. Here, on the plots in my original question, data corresponds to the `freq=TRUE` type and ... – Marta Karas May 04 '15 at 07:09
... and "some normalizing" would make the values be of `probability=TRUE` nature. However, I do not think it is some kind of a crucial issue here :) – Marta Karas May 04 '15 at 07:10

Fit distribution to given frequency values in R

3 Answers3