Subsetting/filtering with dplyr - improving my syntax

Question

I am new to dplyr and trying to improve my syntax. I have the following data frame:

testdf5<- data.frame(
  stringsAsFactors = FALSE,
  col1=c('aa', 'aa', 'aa', 'bb', 'bb', 'bb', 'cc','cc','cc'),
  MyLength=c('500', '500', '600', '500', '600', '600', '700','700','600'),
  col3=c('0.5', '0.5', '0.5', '0.5', '0.5', '0.5', '0.5','0.7','0.7'),
  POS=c(
    500, 1000, 2000,
    400, 500, 600,
    10000, 10500, 11000))

I want to:

1) group the rows by col1, Mylength and col3;

2) for each group, I want the minimum and the maximum POS

This is the result I want:

col1 MyLength col3 MinPos    MaxPOS  
aa      500   0.5    500     1000  
aa      600   0.5   2000     2000  
bb      500   0.5    400      400  
bb      600   0.5    500      600  
cc      600   0.7  11000     11000 
cc      700   0.5  10000     10000 
cc      700   0.7  10500     10500

This is my code, which works:

testdf6<- testdf5 %>%
  #needs '.dots' to read a character vector
  dplyr::group_by(.dots=c('col1', 'MyLength', 'col3')) %>%
  dplyr::filter(POS==min(POS)) ##get min(POS)
colnames(testdf6)[4] <- 'MinPos'

testdf7<- testdf5 %>%
  #needs '.dots' to read a character vector
  dplyr::group_by(.dots=c('col1', 'MyLength', 'col3')) %>%
  dplyr::filter(POS==max(POS)) ##Get max(POS)
# 
 colnames(testdf7)[4] <- 'MaxPos'
#Now merge
testdf8<- merge(testdf6, testdf7, by = c('col1',  'MyLength', 'col3'))

I am basically doing the same operation twice, and I was wondering if there is a cleaner way, as I am trying to improve my syntax. I look forward to your feedback.

`testdf5 %>% group_by(col1, MyLength, col3) %>% summarise(MinPOS = min(POS), MaxPOS = max(POS))` — Ronak Shah, Sep 28 '18 at 18:38
Hey, Ronak, thank you. I noticed that you answered before @Callum (thanks to both). Care to make your comment an answer, so I can accept it, since you were first? — Max_IT, Sep 28 '18 at 18:46
Actually, this question was asked before and I have marked this as duplicate. You can accept Calum's answer. — Ronak Shah, Sep 28 '18 at 18:50

score 2 · Accepted Answer · answered Sep 28 '18 at 18:39

You should use summarise to compute statistics like this, and you can compute several different ones at once:

library(tidyverse)
testdf5<- data.frame(
  stringsAsFactors = FALSE,
  col1=c('aa', 'aa', 'aa', 'bb', 'bb', 'bb', 'cc','cc','cc'),
  MyLength=c('500', '500', '600', '500', '600', '600', '700','700','600'),
  col3=c('0.5', '0.5', '0.5', '0.5', '0.5', '0.5', '0.5','0.7','0.7'),
  POS=c(
    500, 1000, 2000,
    400, 500, 600,
    10000, 10500, 11000))

testdf5 %>%
  group_by(col1, MyLength, col3) %>%
  summarise(MinPos = min(POS), MaxPos = max(POS))
#> # A tibble: 7 x 5
#> # Groups:   col1, MyLength [?]
#>   col1  MyLength col3  MinPos MaxPos
#>   <chr> <chr>    <chr>  <dbl>  <dbl>
#> 1 aa    500      0.5      500   1000
#> 2 aa    600      0.5     2000   2000
#> 3 bb    500      0.5      400    400
#> 4 bb    600      0.5      500    600
#> 5 cc    600      0.7    11000  11000
#> 6 cc    700      0.5    10000  10000
#> 7 cc    700      0.7    10500  10500

Created on 2018-09-28 by the reprex package (v0.2.0).

Subsetting/filtering with dplyr - improving my syntax

1 Answers1