0

This question is similar to this one, but in a summarise so the posted answer doesn't quite fit. The data are such that a row identifies a unit-time pair:

large_sql_df
id   t   var1   var2
1    1    10     0
1    2    20     1
2    1    11     0

And I would like to aggregate by var2 and time t:

localdf <- large_sql_df %>%
     group_by(var, t) %>%
     summarise(count = n(), var1_mean = mean(var1))

This gives the error: "Arithmetic overflow error converting expression to data type int." I think this is because count becomes a very large number. Is there a way to stop this from happening without having to do the entire query in SQL?

Liam
  • 159
  • 11

2 Answers2

2

The arithmetic overflow limit is 2,147,483,647 (see here) so this is probably not caused by count. It is much more likely caused by mean(var1) because to calculate the mean the computer first has to calculate sum(var1). If your var1 column often contains values greater than 40 then an overflow error is likely (40 * 350 million > 2,147,483,647).

The solution from here is relevant for your question too: Explicitly transform var1 into a more suitable format. Something like the following:

localdf <- large_sql_df %>%
     mutate(var1 = CAST(var1 AS FLOAT)) %>% 
     group_by(var, t) %>%
     summarise(count = n(), var1_mean = mean(var1))

Depending on your version of dbplyr you might need something like mutate(var1 = sql("CAST(var1 AS FLOAT)")).

See this question for a discussion of different data types. While float is an inexact data type it can handle numbers up to 10E38. If you are taking the average over such a large total then the impression introduced is unlikely to be significant.

Simon.S.A.
  • 6,240
  • 7
  • 22
  • 41
2

This just elaborates some elements of the answer from @simon-s-a:

library(dplyr, warn.conflicts = FALSE)

df <- read.table(header = TRUE, text = "
id   t   var1   var2
1    1    10     0
1    2    20     1
2    1    11     0")

large_sql_df <- dbplyr::tbl_memdb(df)

localdf <- 
  large_sql_df %>%
  mutate(var1 = as.double(var1)) %>% 
  group_by(var2, t) %>%
  summarise(count = n(), var1_mean = mean(var1, na.rm = TRUE),
            .groups = "drop")

localdf
#> # Source:   lazy query [?? x 4]
#> # Database: sqlite 3.35.5 [:memory:]
#>    var2     t count var1_mean
#>   <int> <int> <int>     <dbl>
#> 1     0     1     2      10.5
#> 2     1     2     1      20

localdf %>% show_query()
#> <SQL>
#> SELECT `var2`, `t`, COUNT(*) AS `count`, AVG(`var1`) AS `var1_mean`
#> FROM (SELECT `id`, `t`, CAST(`var1` AS REAL) AS `var1`, `var2`
#> FROM `df`)
#> GROUP BY `var2`, `t`

Created on 2021-07-07 by the reprex package (v2.0.0)

Ian Gow
  • 3,098
  • 1
  • 25
  • 31