How to extract subgroups in R

Question

Suppose this dataset:

    date      exdate         flag       V1            V2
 1996-01-04  1996-01-20       P       400000      -0.001181
 1996-01-04  1996-01-20       C       400000      -0.004897
 .............
 1996-01-04  1996-01-20       P       530000      -0.005147
 .............
 1996-01-04  1996-01-20       P       535000      -0.005423
 .............
 1996-01-04  1996-01-20       C       545000      -0.007922
 1996-01-04  1996-01-20       P       545000      -0.008389

I have ordered the data in the desired format, and I want to only extract the rows that are in the same group of date, exdate, and V1, but remove the groups with a single observation. Note that there are groups that contain only one observation, and not both C and P (in the flag variable) as in the first and last group.

It is possible through a package in R like dplyr::filter?

I have tried this:

data %>% group_by(date,exdate,V1) %>% filter(V1[flag=="P"]==V1[flag=="C"])

with a returning error:

Error in filter_impl(.data, dots) : incorrect length (0), expecting: 1

The goal is to get a dataset that is like this:

    date      exdate         flag       V1            V2
 1996-01-04  1996-01-20       P       400000      -0.001181
 1996-01-04  1996-01-20       C       400000      -0.004897
 .............
 1996-01-04  1996-01-20       C       545000      -0.007922
 1996-01-04  1996-01-20       P       545000      -0.008389

Do you have same number of elements that are 'P' and 'C' for each group — akrun, May 15 '17 at 17:16
V1[FALSE] will have a length of zero, explaining that error... — Frank, May 15 '17 at 17:17
Did you mean to write: `filter( length(V1[flag=="P"]) == length(V1[flag=="C"]) )` ? — IRTFM, May 15 '17 at 17:19
Possibly relevant: [Count number of observations/rows per group and add result to data frame](http://stackoverflow.com/questions/7450600/count-number-of-observations-rows-per-group-and-add-result-to-data-frame) and [Subset data frame based on number of rows per group](http://stackoverflow.com/questions/20204257/subset-data-frame-based-on-number-of-rows-per-group) — Henrik, May 15 '17 at 17:24
@akrun The p and c elements are not of equal number. @42 I hadn't thought of using the `length` in the `filter`. Will give it a shot! — Hercules Apergis, May 15 '17 at 18:10
@Frank It could explain it...yeah, so basically whenever that happens, that group will have either C or P, but not both! Thus that observation must be removed. — Hercules Apergis, May 15 '17 at 18:18

score 1 · Accepted Answer · answered May 15 '17 at 17:28

1

data <- read.table(header = T, text = '
date      exdate         flag       V1            V2
1996-01-04  1996-01-20       P       400000      -0.001181
1996-01-04  1996-01-20       C       400000      -0.004897
1996-01-04  1996-01-20       P       530000      -0.005147
1996-01-04  1996-01-20       P       535000      -0.005423
1996-01-04  1996-01-20       C       545000      -0.007922
1996-01-04  1996-01-20       P       545000      -0.008389
')

library(dplyr)
data %>% 
  group_by(date, exdate, V1) %>% 
  filter(n() == 2)

answered May 15 '17 at 17:28

CJ Yetman

8,373
2
24
56

Could you please explain what the `n() ==2` do?! – Hercules Apergis May 15 '17 at 18:12
`n()` is a `dplyr` function that gives the number of rows... in this case, the number of rows in each group. So `n() == 2` filters only lines that are in groups of 2 – CJ Yetman May 15 '17 at 18:57

How to extract subgroups in R

1 Answers1