I need to fit lognormal, Pareto, and generalized Pareto distributions to some empirical data that is a combination of censored and uncensored data. I tried using the function fitdistcens from the fitdistrplus package.
I generated some artificial data consisting of both censored data and uncensored data (my R code is below).
The censored data can be described as follows: There are 5000 values between 100 and 200, there are 700 values between 200 and 500, there are 600 values between 500 and 1000, and so on. We have no further information about these values. The complete list of all bins for the censored data is:
[100, 200]
[200, 500]
[500, 1000]
[1000, 2000]
[2000, 5000]
[5000, 10000]
[10000, 20000]
[20000, 100000]
The uncensored data was created by generating 70 normally distributed random variables with mean 0 and standard deviation 1, then squaring the variables, then multiplying them by 50000, and finally adding 20000. So these are the values that we know exactly.
I then fit lognormal and Pareto distributions to the combined censored and uncensored data, using the "fitdistcens" function in R from the package "fitdistrplus". Then I created QQ plots, PP plots, and plotted comparisons of empirical and theoretical CDFs to assess the goodness of fit of the distributions.
When I look at the plots mentioned above, I see that in the PP plot and the QQ plot, there is a rectangle to represent all of the bins above, EXCEPT for the bin [20000, 100000]. I wonder if somebody could shed some light on how these graphical representations of the empirical distribution were constructed.
[QQ plot][1]
library(fitdistrplus)
# Creating artificial censored data
left <- c(100,200,500,1000,2000,5000,10000,20000)
right <- c(200,500,1000,2000,5000,10000,20000,100000)
freqs <- c(5000,700,600,300,150,100,50,25)
df <- data.frame(left,right)
df_censored <- df[rep(seq_len(nrow(df)),times=freqs),]
# Create artificial uncensored data
left <- 20000 + 50000*rnorm(70)^2
right <- left
df_uncensored <- data.frame(left,right)
df_cens_and_uncens <- rbind(df_censored,df_uncensored)
dist_fit_lnorm <- fitdistcens(df_cens_and_uncens, "lnorm")
distr_to_plot <-list(lnorm = dist_fit_lnorm)
# plot function to compare empirical and fitted cdfs
cdfcompcens(distr_to_plot, xlim = c(0, 35000), plotstyle = "ggplot")
# pp plot for the different distribution to check goodness of fit
ppcompcens(distr_to_plot, plotstyle = "ggplot")
# qq plot for the different distribution to check goodness of fit
qqcompcens(distr_to_plot, plotstyle = "ggplot", xlim = c(0, 100000))