I have been generating some features for clustering and needed the correlation coefficient based off of customer claims submitted over time. I used this code to get the coefficient by running a lm
model over nested tibbles of data:
provProfileTemp <- byProvProfile %>%
mutate(date = ymd(paste(Year, Month, "01", sep = "-"))) %>%
select(-Month, -Year) %>%
group_by(AccountNumber, date) %>%
count() %>%
group_by(AccountNumber) %>%
mutate(total_claims = sum(n)) %>%
ungroup() %>%
mutate(numeric_date = as.numeric(date)/(24*60*60)) %>% # POSIX conversion for summary(lm)
select(AccountNumber, numeric_date, claims = n, total_claims) %>%
nest(-AccountNumber, -total_claims)
coeffs <- provProfileTemp %>%
mutate(
fit = map(provProfileTemp$data, ~lm(numeric_date ~ claims, data = .)),
results = map(fit, summary, correlation = TRUE),
coeff = results %>% map(c("correlation")) %>% map(3)
) %>%
select(AccountNumber, coeff, total_claims)
The top block creates the variables needed for the regression line and nests the data into a tibble with the account number, total claims, and a tibble of the data for the regression. Using purrr::map
in the second block, I'm able to fit a line, get the results from the summary, and pull the coeff from the summary.
The results are correct and work fine, however, the new column is a list with the single value of the coefficient in it. I cannot get compress the list to use the new column as just the coefficient and not a list. Using unlist()
gives this error: Error in mutate_impl(.data, dots) : Column
coeffmust be length 27768 (the number of rows) or one, not 21949
. This is happening because unlist()
is not returning the same number of elements. I have had similar results with functions like purrr::flatten
or unlist(lapply(coeff, "[[", 1))
.
Any suggestions on how I can flatten the list properly into a single value or approach the problem in a different way which doesn't require generating the coefficient like this? Any help is greatly appreciated. Thank you.
This is what the data looks like:
AccountNumber coeff total_claims
<int> <list> <int>
16 <dbl [1]> 494
19 <dbl [1]> 184
45 <dbl [1]> 81...
Here is dummy data:
provProfileTemp <- structure(list(AccountNumber = c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L
), Year = c(2018L, 2017L, 2018L, 2018L, 2018L, 2017L, 2018L,
2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L,
2018L, 2018L, 2018L, 2018L), Month = c(4L, 11L, 1L, 1L, 3L, 10L,
1L, 3L, 7L, 1L, 5L, 10L, 5L, 2L, 4L, 4L, 4L, 3L, 2L, 1L)), .Names = c("AccountNumber",
"Year", "Month"), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))