0

I am using a nested data approach to apply a censored data model to stream clarity data in 1275 streams ~225,000 observations). I have successfully used group_by to group the data set to three hierarchal levels ( HUC4, Major watershed, and Stream… think country, state, county). I want to pursue this approach as it appears to be vastly faster and easier to read than the for-loop approach I have been using. Howerver I am getting the error: NA/NaN/Inf in foreign function call, when I map the model to the nested data frame. This is extremely puzzling, since the approach works fine when I apply it to the large and middle sized group_by data frames. Also it is odd since the list elements in each of the three group_by data frames are identical (just grouped at different levels). The data are large and unwieldy, but I can try and give some clues as to the structure.

The starting data look like this:

> summary(tb_cens)
     huc4           loc_major_basin    sys_loc_code        sample_date               y              m         
 Length:203631      Min.   : 4010101   Length:203631      Min.   :1998-04-06   Min.   :1998   Min.   : 1.000  
 Class :character   1st Qu.: 7010207   Class :character   1st Qu.:2006-05-27   1st Qu.:2006   1st Qu.: 5.000  
 Mode  :character   Median : 7020011   Mode  :character   Median :2009-09-10   Median :2009   Median : 7.000  
                    Mean   : 7193116                      Mean   :2009-10-29   Mean   :2009   Mean   : 6.676  
                    3rd Qu.: 7040004                      3rd Qu.:2013-08-28   3rd Qu.:2013   3rd Qu.: 8.000  
                    Max.   :10230003                      Max.   :2018-10-23   Max.   :2018   Max.   :12.000  
       d              doy        combined_stube_conv100_conv60 detection_limit record_length   censored1      
 Min.   : 1.00   Min.   :  1.0   Min.   :  0.00                TRUE : 80189    Min.   :10.00   Mode :logical  
 1st Qu.: 9.00   1st Qu.:143.0   1st Qu.: 26.00                FALSE:123442    1st Qu.:12.00   FALSE:159845   
 Median :16.00   Median :184.0   Median : 58.57                                Median :14.00   TRUE :43786    
 Mean   :16.02   Mean   :187.8   Mean   : 53.29                                Mean   :14.48                  
 3rd Qu.:24.00   3rd Qu.:233.0   3rd Qu.: 72.00                                3rd Qu.:17.00                  
 Max.   :31.00   Max.   :365.0   Max.   :100.00                                Max.   :26.00                  
 censored2      
 Mode :logical  
 FALSE:167033   
 TRUE :36598 

In my case the commands are

##### create the model function
cens_model <- function(tb_cens) {
survreg(Surv(left_clarity, right_clarity, type = 'interval2') ~ y + m, data = tb_cens, dist = 'gaussian')
}

##### group_by huc4 (12 huc4s)
by_huc4 %
group_by(huc4) %>%
nest()

# apply censored data model to each huc4 and mutate results to data frame
by_huc4 %
mutate(huc_model = map(data, cens_model))
by_huc4

Which works perfectly! Also,

##### group_by watershed (75 major watersheds)
by_watershed %
group_by(loc_major_basin) %>%
nest()

# apply censored data model to each watershed and mutate results to data frame
by_watershed %
mutate(watershed_model = map(data, cens_model))
by_watershed

Which also works perfectly! However, trying the same technique on streams (smallest group_by level) throws an error about NA/NaN/Inf in foreign function call.

##### group_by stream
by_stream %
group_by(sys_loc_code) %>%
nest()

# apply censored data model to each watershed and mutate results to data frame
by_stream %
mutate(stream_model = map(data, cens_model))
by_stream

This gives the following error:

Error in mutate_impl(.data, dots) :
Evaluation error: NA/NaN/Inf in foreign function call (arg 3).

There are no NAs or NaNs in my data. There are some Inf in the the final column but the Tobit model required those as they specify right censored data (And the map function worked perfectly with the largest and middle group_by levels. It only had trouble when I grouped by the stream level).

Does anyone have ideas about trying to run it to ground. Any thoughts would be much appreciated

Joe
  • 8,073
  • 1
  • 52
  • 58
kray
  • 377
  • 1
  • 3
  • 11
  • 1
    Does each stream have enough rows to run the model? – Jon Spring Dec 03 '18 at 21:32
  • Could you make your problem reproducible by sharing a sample of your data so others can help (please do not use `str()`, `head()` or screenshot)? You can use the [`reprex`](https://reprex.tidyverse.org/articles/articles/magic-reprex.html) and [`datapasta`](https://cran.r-project.org/web/packages/datapasta/vignettes/how-to-datapasta.html) packages to assist you with that. See also [Help me Help you](https://speakerdeck.com/jennybc/reprex-help-me-help-you?slide=5) & [How to make a great R reproducible example?](https://stackoverflow.com/q/5963269) – Tung Dec 04 '18 at 02:16
  • I've tried filtering for n >= 100 records per stream (which should be plenty of records for the analysis), and I'm gettnig the same error. I will try and subset out a sample of the data that still produces the error and posting it. I was just concerned about the size of such a data set. – kray Dec 04 '18 at 14:29

0 Answers0