0

I have been generating some features for clustering and needed the correlation coefficient based off of customer claims submitted over time. I used this code to get the coefficient by running a lm model over nested tibbles of data:

provProfileTemp <- byProvProfile %>% 
  mutate(date = ymd(paste(Year, Month, "01", sep = "-"))) %>% 
  select(-Month, -Year) %>% 
  group_by(AccountNumber, date) %>% 
  count() %>% 
  group_by(AccountNumber) %>% 
  mutate(total_claims = sum(n)) %>% 
  ungroup() %>% 
  mutate(numeric_date = as.numeric(date)/(24*60*60)) %>% # POSIX conversion for summary(lm)
  select(AccountNumber, numeric_date, claims = n, total_claims) %>% 
  nest(-AccountNumber, -total_claims)

coeffs <- provProfileTemp %>% 
  mutate(
    fit = map(provProfileTemp$data, ~lm(numeric_date ~ claims, data = .)), 
    results = map(fit, summary, correlation = TRUE), 
    coeff = results %>% map(c("correlation")) %>% map(3)
  ) %>% 
 select(AccountNumber, coeff, total_claims) 

The top block creates the variables needed for the regression line and nests the data into a tibble with the account number, total claims, and a tibble of the data for the regression. Using purrr::map in the second block, I'm able to fit a line, get the results from the summary, and pull the coeff from the summary.

The results are correct and work fine, however, the new column is a list with the single value of the coefficient in it. I cannot get compress the list to use the new column as just the coefficient and not a list. Using unlist() gives this error: Error in mutate_impl(.data, dots) : Columncoeffmust be length 27768 (the number of rows) or one, not 21949. This is happening because unlist() is not returning the same number of elements. I have had similar results with functions like purrr::flatten or unlist(lapply(coeff, "[[", 1)).

Any suggestions on how I can flatten the list properly into a single value or approach the problem in a different way which doesn't require generating the coefficient like this? Any help is greatly appreciated. Thank you.

This is what the data looks like:

AccountNumber       coeff  total_claims
        <int>      <list>         <int>
           16   <dbl [1]>           494     
           19   <dbl [1]>           184     
           45   <dbl [1]>            81...

Here is dummy data:

provProfileTemp <- structure(list(AccountNumber = c(1L, 1L, 1L, 1L, 
     1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
     2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L
     ), Year = c(2018L, 2017L, 2018L, 2018L, 2018L, 2017L, 2018L, 
     2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
     2018L, 2018L, 2018L, 2018L), Month = c(4L, 11L, 1L, 1L, 3L, 10L, 
     1L, 3L, 7L, 1L, 5L, 10L, 5L, 2L, 4L, 4L, 4L, 3L, 2L, 1L)), .Names =               c("AccountNumber", 
     "Year", "Month"), row.names = c(NA, -20L), class = c("tbl_df", 
     "tbl", "data.frame"))
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
john_mwood
  • 46
  • 1
  • 10
  • 2
    I think you may want `map_dbl(3)` instead of `map(3)`. If you put in a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) I would be able to verify. – aosmith Nov 07 '18 at 15:43
  • I'm working on making a dummy data frame, however, I can't seem to force R to make the column a list. I was using `map_dbl` before but it doesn't work the way I expect it to. I get `Error in mutate_impl(.data, dots) : Evaluation error: Result 20 is not a length 1 atomic vector` – john_mwood Nov 07 '18 at 15:52
  • 1
    Definitely need an example dataset. Can you `dput()` part of `provProfileTemp ` and add it to the question? When I use `mtcars` for a regression by groups `map_dbl()` seems to work fine: `mtcars %>% group_by(cyl) %>% nest() %>% mutate(fit = map(data, ~ lm(mpg ~ wt, data = .x)), results = map(fit, summary, correlation = TRUE), coef = results %>% map(c("correlation")) %>% map_dbl(3))` – aosmith Nov 07 '18 at 15:59
  • Have you tried `unnest(coeff)` at the end? – aosmith Nov 07 '18 at 16:03
  • According to the errors I got, `unnest` on coeff doesn't work on lists. – john_mwood Nov 07 '18 at 16:06
  • Thank you. `dput()` is super helpful @aosmith I didn't know that was a thing. – john_mwood Nov 07 '18 at 16:08
  • Updated the OP with dummy data – john_mwood Nov 07 '18 at 16:15
  • `map_dbl` works on this dummy data for some reason. It is possible I get an error because some data is missing and the `lm` doesn't produce anything in the first place? – john_mwood Nov 07 '18 at 16:20

1 Answers1

2

Your comment about having some data missing and lm() not producing anything is the key here.

First, let's create a scenario with only a single value of the explanatory variable for one group. This reproduces errors with map_dbl() and unnest()`, etc.

library(purrr)
library(tidyr)
library(dplyr)

mtcars$wt2 = mtcars$wt
mtcars$wt2[mtcars$cyl == 4] = NA
mtcars$wt2[3] = 1

mtcars %>% 
    group_by(cyl) %>% 
    nest() %>% 
    mutate(fit = map(data, ~ lm(mpg ~ wt2, data = .x)), 
           results = map(fit, summary, correlation = TRUE), 
           coef = results %>% map(c("correlation")) %>% map_dbl(3))

Error in mutate_impl(.data, dots) : Evaluation error: Result 2 is not a length 1 atomic vector.

This is because one of the results is NULL.

mtcars %>% 
    group_by(cyl) %>% 
    nest() %>% 
    mutate(fit = map(data, ~ lm(mpg ~ wt2, data = .x)), 
           results = map(fit, summary, correlation = TRUE), 
           coef = results %>% map(c("correlation")) %>% map(3)) %>%
    pull(coef)

[[1]]
[1] -0.9944458

[[2]]
NULL

[[3]]
[1] -0.983668

So you need to replace the NULL with something (or remove the rows without enough data prior to doing the model fitting, which could be the easiest solution). I often use possibly() in situations like this, although it was harder for your scenario. I ended up following this answer, but I'm sure there are other ways/tools to do this.

I return NA_real_ whenever there is no 3rd value in the correlation matrix.

mtcars %>% 
    group_by(cyl) %>% 
    nest() %>% 
    mutate(fit = map(data, ~ lm(mpg ~ wt2, data = .x)), 
           results = map(fit, summary, correlation = TRUE), 
           coef = results %>% map(c("correlation")) %>% 
               map_dbl(., possibly(~.x[3], NA_real_)))

# A tibble: 3 x 5
    cyl data               fit      results             coef
  <dbl> <list>             <list>   <list>             <dbl>
1     6 <tibble [7 x 11]>  <S3: lm> <S3: summary.lm>  -0.994
2     4 <tibble [11 x 11]> <S3: lm> <S3: summary.lm>  NA    
3     8 <tibble [14 x 11]> <S3: lm> <S3: summary.lm>  -0.984
aosmith
  • 34,856
  • 9
  • 84
  • 118
  • 1
    Wow. Great, thorough explanation. Thank you @aosmith It also works on my data really well. I appreciate all of the help. – john_mwood Nov 07 '18 at 17:05