1

I need to fit lognormal, Pareto, and generalized Pareto distributions to some empirical data that is a combination of censored and uncensored data. I tried using the function fitdistcens from the fitdistrplus package.

I generated some artificial data consisting of both censored data and uncensored data (my R code is below).

The censored data can be described as follows: There are 5000 values between 100 and 200, there are 700 values between 200 and 500, there are 600 values between 500 and 1000, and so on. We have no further information about these values. The complete list of all bins for the censored data is:

[100, 200]

[200, 500]

[500, 1000]

[1000, 2000]

[2000, 5000]

[5000, 10000]

[10000, 20000]

[20000, 100000]

The uncensored data was created by generating 70 normally distributed random variables with mean 0 and standard deviation 1, then squaring the variables, then multiplying them by 50000, and finally adding 20000. So these are the values that we know exactly.

I then fit lognormal and Pareto distributions to the combined censored and uncensored data, using the "fitdistcens" function in R from the package "fitdistrplus". Then I created QQ plots, PP plots, and plotted comparisons of empirical and theoretical CDFs to assess the goodness of fit of the distributions.

When I look at the plots mentioned above, I see that in the PP plot and the QQ plot, there is a rectangle to represent all of the bins above, EXCEPT for the bin [20000, 100000]. I wonder if somebody could shed some light on how these graphical representations of the empirical distribution were constructed.

[QQ plot][1]

library(fitdistrplus)

# Creating artificial censored data

left <- c(100,200,500,1000,2000,5000,10000,20000)
right <- c(200,500,1000,2000,5000,10000,20000,100000)
freqs <- c(5000,700,600,300,150,100,50,25)
df <- data.frame(left,right)

df_censored <- df[rep(seq_len(nrow(df)),times=freqs),]

# Create artificial uncensored data

left <- 20000 + 50000*rnorm(70)^2
right <- left

df_uncensored <- data.frame(left,right)

df_cens_and_uncens <- rbind(df_censored,df_uncensored)

dist_fit_lnorm <- fitdistcens(df_cens_and_uncens, "lnorm")

distr_to_plot <-list(lnorm = dist_fit_lnorm)

# plot function to compare empirical and fitted cdfs
cdfcompcens(distr_to_plot, xlim = c(0, 35000), plotstyle = "ggplot")

# pp plot for the different distribution to check goodness of fit
ppcompcens(distr_to_plot, plotstyle = "ggplot")

# qq plot for the different distribution to check goodness of fit
qqcompcens(distr_to_plot, plotstyle = "ggplot", xlim = c(0, 100000))
Chris J
  • 23
  • 3
  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. This doesn't seem like a specific programming question. If you want statistical recommendations for fitting your data, you can ask for help at [stats.se]. Maybe clarify exactly what is "terrible" about the results using uncensored data. – MrFlick Jun 25 '21 at 17:17
  • Please see my edits, thanks. – Chris J Jul 14 '21 at 18:58

0 Answers0