0

We're analyzing columns from our SQL Server environment. We're pulling column names and data types. Then we're running a simple pipe argument to see if we have mixed data types for the same column names across disparate tables.

library(tidyverse)
DF = data.frame(COLUMN_NAME = c("PARTYID","PARTYID","AGE","AGE","SALESID","SALES"), 
                DATA_TYPE = c("char","tinyint","int","smallint","varchar","numeric"))
DF %>% group_by(COLUMN_NAME) %>% 
           summarise(mixedTypes = (any(grepl("char", DATA_TYPE)) & 
                                  !(all(grepl("char", DATA_TYPE)))))

All I'm getting back is

  mixedTypes
1       TRUE

But I believe I should be getting back a subset of the data.frame, including both columns with a new column called mixedTypes.

UPDATE: Someone suggested using conflicts and I'm not educated enough to understand how to interpret the detail=TRUE output:

$.GlobalEnv
[1] "df"

$`package:forcats`
[1] "%>%" "%>%" "%>%" "%>%" "%>%"

$`package:purrr`
[1] "%>%"       "%>%"       "compact"   "%>%"       "%>%"       "set_names" "%>%"      

$`package:tidyr`
[1] "%>%"     "%>%"     "%>%"     "%>%"     "extract" "%>%"    

$`package:plyr`
 [1] "compact"     "arrange"     "count"       "desc"        "failwith"    "id"          "mutate"      "rename"      "summarise"  
[10] "summarize"   "is.discrete" "summarize"  

$`package:stringr`
[1] "%>%" "%>%" "%>%" "%>%" "%>%"

$`package:tibble`
 [1] "add_row"       "as_data_frame" "as_tibble"     "data_frame"    "data_frame_"   "frame_data"    "glimpse"       "lst"          
 [9] "lst_"          "tbl_sum"       "tibble"        "tribble"       "trunc_mat"     "type_sum"     

$`package:magrittr`
[1] "%>%"       "%>%"       "%>%"       "%>%"       "extract"   "set_names" "%>%"      

$`package:dplyr`
 [1] "%>%"           "%>%"           "%>%"           "%>%"           "%>%"           "add_row"       "arrange"       "as_data_frame"
 [9] "as_tibble"     "count"         "data_frame"    "data_frame_"   "desc"          "failwith"      "frame_data"    "glimpse"      
[17] "id"            "lst"           "lst_"          "mutate"        "rename"        "summarise"     "summarize"     "tbl_sum"      
[25] "tibble"        "tribble"       "trunc_mat"     "type_sum"      "src"           "summarize"     "coalesce"      "filter"       
[33] "lag"           "intersect"     "setdiff"       "setequal"      "union"        

$`package:Hmisc`
[1] "summarize"   "is.discrete" "src"         "summarize"   "format.pval" "units"      

$`package:ggplot2`
[1] "Position"

$`package:MyPackage`
[1] "coalesce" "HeatMap" 

$`package:stats`
[1] "df"     "filter" "lag"   

$`package:methods`
[1] "body<-"    "kronecker"

$`package:base`
 [1] "body<-"      "format.pval" "HeatMap"     "intersect"   "kronecker"   "Position"    "setdiff"     "setequal"    "union"      
[10] "units"   
quickreaction
  • 675
  • 5
  • 17
  • What version of dplyr are you using? – Robin Gertenbach Jul 02 '18 at 20:02
  • 2
    If i copy/paste the code above, that is not the result i'm getting. Have you tried this in a fresh R session? Anything suspicious in your `conflicts()`? – MrFlick Jul 02 '18 at 20:02
  • You're right. Fresh R session and it was fine. I don't know how to interpret the `conflicts()` output... there are multiple pipes (`%>%`) in `forcats`, `purrr`, `tidyr`, `stringr`, `magrittr`, and `dplyr` – quickreaction Jul 02 '18 at 20:12
  • Looks like `dplyr` 0.7.4 – quickreaction Jul 02 '18 at 20:16
  • 3
    Looks you have both `package:plyr` and `package:dplyr` loaded which is usually not a good idea. But if you do need both of them, make sure to load `plyr` before `dplyr`. – MrFlick Jul 02 '18 at 20:34
  • 1
    Did you load **plyr** before **dplyr**? If so, [this question](https://stackoverflow.com/questions/26923862/why-are-my-dplyr-group-by-summarize-not-working-properly-name-collision-with) is a possible duplicate. – aosmith Jul 02 '18 at 20:52
  • plyr was being loaded in a function that was being called earlier in the script. It seems this call superseded the dplyr function. Flag this as a duplicate if you believe it so... – quickreaction Jul 03 '18 at 13:13

1 Answers1

1

As said in the comments, the problem is that the plyr version of summarise is loaded after dplyr so when you call summarise you are getting the wrong one. You should try to load plyr first (or much better, try not to load it at all), but you can also play safe by being explicit which version of summarise you want.

library(tidyverse)
DF = data.frame(COLUMN_NAME = c("PARTYID","PARTYID","AGE","AGE","SALESID","SALES"), 
                DATA_TYPE = c("char","tinyint","int","smallint","varchar","numeric"))

# bad:
DF %>% group_by(COLUMN_NAME) %>% 
  plyr::summarise(mixedTypes = (any(grepl("char", DATA_TYPE)) & 
                            !(all(grepl("char", DATA_TYPE)))))

# good:
DF %>% group_by(COLUMN_NAME) %>% 
  dplyr::summarise(mixedTypes = (any(grepl("char", DATA_TYPE)) & 
                            !(all(grepl("char", DATA_TYPE)))))

If you really need plyr loaded as well as dplyr it would be a good idea to do it this way, and also with other key conflicts like mutate. But better is to avoid having both loaded at once.

Peter Ellis
  • 5,694
  • 30
  • 46