0

I have a dataset containing insurance pricing and coverage information. The first column refers to the policy identifier, and the remaining columns refer to premium, limit, deductible, and further details as dummy variables (State and coverage).

Identifier Price Limit Deductible Peril1 Peril2 Peril3 Peril4 Peril5 Peril6 State1 State2 State3 State4
POL1 250.0 100000 500.0 1 1 1 0 0 1 1 0 0 0
POL1 625.0 100000 1000.0 1 1 1 0 0 1 1 0 0 0
POL1 1650.0 500000 1000.0 1 1 1 0 0 1 1 0 0 0
POL1 2500.0 1000000 1000.0 1 1 1 0 0 1 1 0 0 0
POL1 4375.0 2000000 2000.0 1 1 1 0 0 1 1 0 0 0
POL2 25.0 50000 500.0 0 0 1 1 0 0 1 0 0 0
POL3 60.25 25000 500.0 1 1 1 1 1 1 1 0 0 0
POL3 73.25 50000 500.0 1 1 1 1 1 1 1 0 0 0

Moreover, as it can be seen from the sample dataframe, several rows can refer to the same insurance product. In the original data frame, up to 40 rows may refer to a single policy, while other policies are described in a single row.

I am trying to conduct a multivariate regression

reg <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df)

By conducting the multivariate regression, it emerges that the distribution of residual errors does not follow a normal distribution. I therefore decided to Log() the dependent variable. Moreover, in my dataframe there are several outliers and presence of heteroscedasticity.

For the reasons above I thought WLS regression could be a solution to my problem, because it can help me assigning an appropriate weight to each error term. Trying to understand the functioning and theory behind WLS I tried to conduct simple weighted regression as explained here

wt <- 1 / lm(abs(reg$residuals) ~ reg$fitted.values)$fitted.values^2  
wls_model <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df, weight=wt)

But when looking at the results I don’t think this is the correct approach to tackle my problem, also considering the fact that by trying to solve this issue many rows are not considered.

From my understand, as the weight parameter of lm should be a vector, I could assign a specific weight to each policy. For instance, each row pertaining POL1 is 1/5. Despite having read documentation, relevant posts, and searched for packages that could facilitate my work, it is not clear to me how to implement WLS in my case.

Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
meg
  • 1
  • 2
  • 2
    [Please do not upload images of code/data/errors.](//meta.stackoverflow.com/q/285551) Please provide example data easily usable by others. – Anonymous coward Dec 30 '22 at 17:37
  • 1
    Why not use a robust regression model that calculates the weights for you, like `wtmod <- MASS::rlm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df, method="MM")`, then you can see the weights with `wtmod$w`. – DaveArmstrong Dec 30 '22 at 17:43

0 Answers0