0

I have been researching this for a while and I can't seem to find the issue. I use dplyr regularly, but seems like all of a sudden, I am getting odd output from the group_by/summarise combination.

I have a large dataset and I am trying to summarize it using the following:

dataAgg <- dataRed %>% group_by(ClmNbr, SnapshotDay, Pre2016) %>%
  filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
  summarise(
    NumFeat = sum(FeatureNbr),
    TotInc = sum(IncSnapshotDay),
    TotDelta = sum(InctoFinal),
    TotPaid = sum(FinalPaid)
  )

The setup of the data frame is below:

'data.frame':   123819 obs. of  8 variables:
 $ ClmNbr        : Factor w/ 33617 levels "14-00765132",..: 2162 2163 2163 2164 1842 2287 27 27 27 28 ...
 $ SnapshotDay   : Factor w/ 3 levels "7","30","90": 1 1 1 1 1 1 1 1 1 1 ...
 $ Pre2016       : Factor w/ 2 levels "Post2016","Pre2016": 2 2 2 2 2 2 2 2 2 2 ...
 $ FeatureNbr    : int  6 2 3 3 6 2 4 5 6 5 ...
 $ IncSnapshotDay: num  5000 77 5000 4500 77 2200 1800 1100 1800 25000 ...
 $ FinalPaid     : num  442 0 15000 5000 0 ...
 $ InctoFinal    : num  -4558 -77 10000 500 -77 ...
 $ TimeDelta     : num  25.833 2.833 2.833 0.833 1.833 ...

When I execute the code, I get 1 obs. of 4 variables; there is no grouping applied.

'data.frame':   1 obs. of  4 variables:
 $ NumFeat : int 287071
 $ TotInc  : num NA
 $ TotDelta: num NA
 $ TotPaid : num 924636433

I used to do this all the time without problems.

I could use aggregate, but sometimes, I am mixing and matching functions based on the column so it does not always work.

What am I doing wrong?

Bryan Butler
  • 1,750
  • 1
  • 19
  • 19
  • 13
    Could you have loaded [plyr before dplyr](https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with/26933112#26933112)? – aosmith Oct 02 '18 at 22:20
  • @aosmith: you meant `plyr` "after" `dplyr` right? – Tung Oct 02 '18 at 23:34
  • 1
    @Tung Yes, "after" is what might cause the problem. :-D – aosmith Oct 03 '18 at 00:35
  • No, I am not using plyr at all; unless somehow it gets cached, but I frequently clean the global environment. – Bryan Butler Oct 03 '18 at 02:02
  • What does `sessioInfo()` show? When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. A `str()` isn't as helpful as a `dput()` for testing. – MrFlick Oct 03 '18 at 02:39
  • Cleaning the global environment doesn't unload any packages. – Gregor Thomas Oct 03 '18 at 13:22

1 Answers1

1

So, after a bit of research and some experimentation, the order of the library load matters. The original order was the following:

library(RODBC)
library(dplyr)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)

However, ggplot2 loads in plyr as a dependency, so in order to make this work more smoothly, the order should be revised to load dplyr last; which is what I used to do.

library(RODBC)
library(DT)
library(reshape2)
library(ggplot2)
library(scales)
library(caret)
library(markovchain)
library(knitr)
library(Metrics)
library(RColorBrewer)
library(dplyr)

Alternately, as in Python, it can be accomplished by specifying the library to execute the command. In Python, we import libraries in the following syntax:

import numpy as np

Then any numpy commmands are referenced using np. like np.array() the R syntax is the following library::

Adding dplyr:: to the commands fixes the problem as shown below.

dataAgg <- dataRed %>% dplyr::group_by(ClmNbr, SnapshotDay, Pre2016) %>%
  dplyr::filter(SnapshotDay == '30'| SnapshotDay == '90') %>%
  dplyr::summarise(
    NumFeat = sum(FeatureNbr),
    TotInc = sum(IncSnapshotDay),
    TotDelta = sum(InctoFinal),
    TotPaid = sum(FinalPaid)
  )
Bryan Butler
  • 1,750
  • 1
  • 19
  • 19
  • If you update to a more current version of `ggplot`, it won't attach `plyr` to your search path, it will just use it internally. – Gregor Thomas Oct 03 '18 at 13:29
  • 3
    That was [changed in 2015](https://github.com/tidyverse/ggplot2/commit/4a8b43531a4dd2c1c75b3f75629d494e14867ae5#diff-82aec99efe7f43b9727953d81800795c) ... I wonder what else could use upgrading? – r2evans Oct 03 '18 at 13:35
  • Thanks for the information. However, I did not set up the environment, and thought it was updated given that is R version 3.5.1. It seems like it might make sense to merge a few of these packages (dplyr. plyr, ddply) to eliminate this issue. – Bryan Butler Oct 03 '18 at 13:47