How to detect the multivalued observation for each ID in dataset?

Question

I have a dataset contains 3 different vars like this:

id   gender phase
a1     m      1
a1     m      2
a1     m      3
b2     m      1
b2     f      2
b2     m      3
c3     f      1
c3     f      2
c3     f      3
...

Notice that for id==b2, phase==2, the gender is accidentally marked as "f", it should be consistent with other phases as gender=="m" because the gender cannot be changed during the study phases.So if I want to run a R code to detect which ids have such issue, how should I accomplish that goal? Thanks a lot~~

if you use dput(my_data_frame) it will export a structure people can use to reproduce your data to help you easier try dput(mtcars) to see how it works. then copy this structure from your console here. then people can work with your data — Dimitrios Zacharatos, Jul 13 '22 at 14:58
@Dimitrios Zacharatos the whole dataset has over 160000 observation, not sure I can paste here. But the example dataset i listed above is the same. — Rstudyer, Jul 13 '22 at 15:05

Darren Tsai · Accepted Answer · 2022-07-13T15:13:48.193

3

With dplyr, you could detect which ids have more than one genders with n_distinct().

library(dplyr)

df %>%
  group_by(id) %>%
  filter(n_distinct(gender) > 1) %>%
  ungroup()

# # A tibble: 3 × 3
#   id    gender phase
#   <chr> <chr>  <int>
# 1 b2    m          1
# 2 b2    f          2
# 3 b2    m          3

edited Jul 13 '22 at 15:13

answered Jul 13 '22 at 15:06

Darren Tsai

32,117
5
21
51

score 2 · Answer 2 · answered Jul 13 '22 at 15:09

You can use lag to check if the value changed in the column and filter the id that have a change like this:

df <- read.table(text="id   gender phase
a1     m      1
a1     m      2
a1     m      3
b2     m      1
b2     f      2
b2     m      3
c3     f      1
c3     f      2
c3     f      3", header = TRUE)

library(dplyr)
df %>%
  group_by(id) %>%
  filter(any(gender != lag(gender)))
#> # A tibble: 3 × 3
#> # Groups:   id [1]
#>   id    gender phase
#>   <chr> <chr>  <int>
#> 1 b2    m          1
#> 2 b2    f          2
#> 3 b2    m          3

^{Created on 2022-07-13 by the reprex package (v2.0.1)}

score 1 · Answer 3 · answered Jul 13 '22 at 15:05

id<-c("a1","a1","a1","b2","b2","b2","c3","c3","c3")
gender<-c("m","m","m","m","f","m","f","f","f")
phase<-c(1,2,3,1,2,3,1,2,3)
mydata<-data.frame(id,gender,phase)
mydata[mydata$id%in%c("a1","b2"),"gender"]<-"m"
mydata[mydata$id%in%c("c3"),"gender"]<-"f"
mydata

How to detect the multivalued observation for each ID in dataset?

3 Answers3

Linked