How can I group variables in when dplyr and base R functions don't work?

Question

I need to group my data for province IDs (MUN_RESID) and population (V16). My dataframe contains 8,627,071 observations. I've been trying solutions provided in this forum for days now such as this and this, but nothing works. Any help on this would be greatly appreciated. Thank you very much

This is what the sample looks like:

          X MUN_RESID   V16 X08.2005_P  X09.2005_P X10.2005_P
1             1    110001 13203          0 0.007574036          0
2             2    110001 13203          0 0.007574036          0
3             3    110001 13203          0 0.007574036          0
4             4    110001 13203          0 0.007574036          0
5             5    110001 13203          0 0.007574036          0
6             6    110001 13203          0 0.007574036          0
7             7    110001 13203          0 0.007574036          0
8627069 8627069    530010 14802          0 0.000000000          0
8627070 8627070    530010 14802          0 0.000000000          0
8627071 8627071    530010 14802          0 0.000000000          0


==X==============================================================X==
     Copy+Paste this part. (If on a Mac, it is already copied!)
==X==============================================================X==

 months0606 <- structure(list(X = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8627069L, 8627070L,8627071L),
          MUN_RESID = c(110001L, 110001L, 110001L,
          110001L,110001L, 110001L, 110001L, 530010L, 530010L, 530010L),
          V16 = c(13203L,13203L, 13203L, 13203L, 13203L, 13203L, 13203L, 14802L, 14802L,14802L),
          X08.2005_P = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
          X09.2005_P = c(0.00757403620389305,0.00757403620389305,
          0.00757403620389305, 0.00757403620389305,0.00757403620389305,
          0.00757403620389305, 0.00757403620389305,0, 0, 0),
          X10.2005_P = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
          0)), class = "data.frame", row.names =
          c(1L,2L, 3L, 4L, 5L, 6L, 7L, 8627069L, 8627070L, 8627071L))





==X==============================================================X==

I've tried

months0606_grouped <- ddply(months0606, .(V16))

(does not give me any output at all)

library(dplyr)
months0606 %>% group_by(MUN_RESID, V16)
months0606 %>% dplyr::group_by(MUN_RESID)

(does not give me any error warning, but does no grouping either. This is the output:

# A tibble: 8,627,071 x 20
# Groups:   MUN_RESID [5,227]
       X MUN_RESID   V16 X08.2005_P X09.2005_P X10.2005_P
   <int>     <int> <int>      <dbl>      <dbl>      <dbl>
 1     1    110001 13203          0    0.00757          0
 2     2    110001 13203          0    0.00757          0
 3     3    110001 13203          0    0.00757          0
 4     4    110001 13203          0    0.00757          0
 5     5    110001 13203          0    0.00757          0
 6     6    110001 13203          0    0.00757          0
 7     7    110001 13203          0    0.00757          0
 8     8    110001 13203          0    0.00757          0
 9     9    110001 13203          0    0.00757          0
 10    10    110001 13203          0    0.00757          0
# ... with 8,627,061 more rows, and 14 more variables:
#   X11.2005_P <dbl>, X12.2005_P <dbl>,
#   X01.2006_P <dbl>, X02.2006_P <dbl>,
#   X03.2006_P <dbl>, X04.2006_P <dbl>,
#   X05.2006_P <dbl>, X06.2006_P <dbl>,
#   X07.2006_P <dbl>, X08.2006_P <dbl>,
#   X09.2006_P <dbl>, X10.2006_P <dbl>,
#   X11.2006_P <dbl>, X12.2006_P <dbl>

Also tried:

months0606$V16 <- with(months0606, ifelse(V16 %in% months0606, "V16"))

My goal is to have my dataframe look like this: Every combination of MUN_RESID and population level(V16) only contains one row:

MUN_RESID   V16     X08.2005_P  X09.2005_P 
110001      13203   0           0.007507
530010      530010  0           0

Try `newdf<-months0606 %>% select(-X) %>% dplyr::group_by(MUN_RESID) %>% distinct() %>% ungroup()` or `newdf<-months0606 %>% dplyr::select(-X) %>% dplyr::group_by(MUN_RESID) %>% slice(1) %>% ungroup()` — A. Suliman, May 05 '19 at 17:38
The group_by worked - see output and how there are 5227 groups. Pipe to ```summarize()``` and choose a function like ```sum()```. If you need all columns summarized, use ```summarize_all()```. — Cole, May 05 '19 at 17:40

MatthewR · Accepted Answer · 2019-05-05T18:04:52.427

0

If you are just looking for the unique rows of MUN_RESID and V16 - you can use the duplicated function

months0606[ !duplicated( months0606[ , c( "MUN_RESID","V16")]) , ]

since you are dealing with a large data set you could consider data.table but you need to decide what operations you are doing by your groups. I took the means, in your example it matches the duplicated function, but wouldn't if there were differences in either of the X08 vars

library( data.table )
months0606 <- data.table( months0606 )
months0606[ , .( 
    X08.2005_P=mean(X08.2005_P),
    X09.2005_P=mean( X09.2005_P)
    ),
    by=c("MUN_RESID" ,  "V16" )]

edited May 05 '19 at 18:04

answered May 05 '19 at 17:56

MatthewR

2,660
5
26
37

Thank you very much Matthew, Cole and A. Suliman. Your suggested codes worked well and are such a relief. This was my first question on SO and I'm so happy to have found such a great community helping out with problems. Won't be the last question I asked here, I'm afraid... Thanks! – Lisa Mauch May 07 '19 at 09:27

How can I group variables in when dplyr and base R functions don't work?

1 Answers1