I have a dataset containing insurance pricing and coverage information. The first column refers to the policy identifier, and the remaining columns refer to premium, limit, deductible, and further details as dummy variables (State and coverage).
Identifier | Price | Limit | Deductible | Peril1 | Peril2 | Peril3 | Peril4 | Peril5 | Peril6 | State1 | State2 | State3 | State4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
POL1 | 250.0 | 100000 | 500.0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
POL1 | 625.0 | 100000 | 1000.0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
POL1 | 1650.0 | 500000 | 1000.0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
POL1 | 2500.0 | 1000000 | 1000.0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
POL1 | 4375.0 | 2000000 | 2000.0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
POL2 | 25.0 | 50000 | 500.0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
POL3 | 60.25 | 25000 | 500.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
POL3 | 73.25 | 50000 | 500.0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
Moreover, as it can be seen from the sample dataframe, several rows can refer to the same insurance product. In the original data frame, up to 40 rows may refer to a single policy, while other policies are described in a single row.
I am trying to conduct a multivariate regression
reg <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df)
By conducting the multivariate regression, it emerges that the distribution of residual errors does not follow a normal distribution. I therefore decided to Log()
the dependent variable. Moreover, in my dataframe there are several outliers and presence of heteroscedasticity.
For the reasons above I thought WLS regression could be a solution to my problem, because it can help me assigning an appropriate weight to each error term. Trying to understand the functioning and theory behind WLS I tried to conduct simple weighted regression as explained here
wt <- 1 / lm(abs(reg$residuals) ~ reg$fitted.values)$fitted.values^2
wls_model <- lm(log(Premium) ~ Limit + Deductible + Peril1 + Peril2 + Peril3 + Peril4 + Peril5 + Peril6 + State1+ State2 + State3 + State4, data=df, weight=wt)
But when looking at the results I don’t think this is the correct approach to tackle my problem, also considering the fact that by trying to solve this issue many rows are not considered.
From my understand, as the weight parameter of lm should be a vector, I could assign a specific weight to each policy. For instance, each row pertaining POL1 is 1/5. Despite having read documentation, relevant posts, and searched for packages that could facilitate my work, it is not clear to me how to implement WLS in my case.