-2

I have a dataset with about 3,000 rows. The data can be accessed via https://pastebin.com/i4dYCUQX

Problem: NA results in the output, though there appear to be no NA in the data. Here is what happens when I try to sum the total value in each category of a column via dplyr or aggregate:

example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
example

# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))

Out:
# A tibble: 4 x 2
         size    volume
       <fctr>     <int>
1 Extra Large        NA
2       Large        NA
3      Medium 937581572
4       Small        NA

# aggregate
aggregate(volume ~ size, data=example, FUN=sum)

Out:
         size    volume
1 Extra Large        NA
2       Large        NA
3      Medium 937581572
4       Small        NA

When trying to access the value via colSums, it seems to work:

# Colsums
small <- example %>% filter(size == "Small")
colSums(small["volume"], na.rm = FALSE, dims = 1)

Out:
volume 
3869267348 

Can anyone imagine what the issue could be?

Christopher
  • 2,120
  • 7
  • 31
  • 58
  • 2
    Well, I belive the _Warning messages_ are rather informative: `[...] integer overflow - use sum(as.numeric(.))` – Henrik Oct 14 '17 at 18:41

2 Answers2

1

its because value is an integer and not numeric

example$volume <- as.numeric(example$volume)

aggregate(volume ~ size, data=example, FUN=sum)

         size      volume
1 Extra Large  3609485056
2       Large 11435467097
3      Medium   937581572
4       Small  3869267348

For more check here:

What is integer overflow in R and how can it happen?

DataTx
  • 1,839
  • 3
  • 26
  • 49
1

The first thing to note is that, running your example, I get:

example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum))
#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))

#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))

#> Warning in summarise_impl(.data, dots): integer overflow - use
#> sum(as.numeric(.))
#> # A tibble: 4 × 2
#>          size    volume
#>        <fctr>     <int>
#> 1 Extra Large        NA
#> 2       Large        NA
#> 3      Medium 937581572
#> 4       Small        NA

which clearly states that you're sums are overflowing the integer type. If we do as the warning message suggests, we can convert the integers to numerics and then sum:


example <- read.csv("https://pastebin.com/raw/i4dYCUQX", header=TRUE, sep=",")
# dplyr
example %>% group_by(size) %>% summarize_at(vars(volume), funs(sum(as.numeric(.))))
#> # A tibble: 4 × 2
#>          size      volume
#>        <fctr>       <dbl>
#> 1 Extra Large  3609485056
#> 2       Large 11435467097
#> 3      Medium   937581572
#> 4       Small  3869267348

here the funs(sum) has been replaced by funs(sum(as.numeric(.)) which is the same, executing sum on each group but converting to numeric first.

Eumenedies
  • 1,618
  • 9
  • 13