1

I have a dataset contains 3 different vars like this:

id   gender phase
a1     m      1
a1     m      2
a1     m      3
b2     m      1
b2     f      2
b2     m      3
c3     f      1
c3     f      2
c3     f      3
...

Notice that for id==b2, phase==2, the gender is accidentally marked as "f", it should be consistent with other phases as gender=="m" because the gender cannot be changed during the study phases.So if I want to run a R code to detect which ids have such issue, how should I accomplish that goal? Thanks a lot~~

Rstudyer
  • 309
  • 1
  • 8
  • if you use dput(my_data_frame) it will export a structure people can use to reproduce your data to help you easier try dput(mtcars) to see how it works. then copy this structure from your console here. then people can work with your data – Dimitrios Zacharatos Jul 13 '22 at 14:58
  • @Dimitrios Zacharatos the whole dataset has over 160000 observation, not sure I can paste here. But the example dataset i listed above is the same. – Rstudyer Jul 13 '22 at 15:05

3 Answers3

3

With dplyr, you could detect which ids have more than one genders with n_distinct().

library(dplyr)

df %>%
  group_by(id) %>%
  filter(n_distinct(gender) > 1) %>%
  ungroup()

# # A tibble: 3 × 3
#   id    gender phase
#   <chr> <chr>  <int>
# 1 b2    m          1
# 2 b2    f          2
# 3 b2    m          3
Darren Tsai
  • 32,117
  • 5
  • 21
  • 51
2

You can use lag to check if the value changed in the column and filter the id that have a change like this:

df <- read.table(text="id   gender phase
a1     m      1
a1     m      2
a1     m      3
b2     m      1
b2     f      2
b2     m      3
c3     f      1
c3     f      2
c3     f      3", header = TRUE)

library(dplyr)
df %>%
  group_by(id) %>%
  filter(any(gender != lag(gender)))
#> # A tibble: 3 × 3
#> # Groups:   id [1]
#>   id    gender phase
#>   <chr> <chr>  <int>
#> 1 b2    m          1
#> 2 b2    f          2
#> 3 b2    m          3

Created on 2022-07-13 by the reprex package (v2.0.1)

Quinten
  • 35,235
  • 5
  • 20
  • 53
1
id<-c("a1","a1","a1","b2","b2","b2","c3","c3","c3")
gender<-c("m","m","m","m","f","m","f","f","f")
phase<-c(1,2,3,1,2,3,1,2,3)
mydata<-data.frame(id,gender,phase)
mydata[mydata$id%in%c("a1","b2"),"gender"]<-"m"
mydata[mydata$id%in%c("c3"),"gender"]<-"f"
mydata