Filtering data based on race

Question

I would appreciate any help to solve a small problem. In the table below, as you can see, each ID has a GFR value for both White and Black race. Based on the race column, I need to only pick the GFR value that matches the race of the ID. For example, ID 1088 is of "White" Race, hence, I need to choose the "GFR Caucas" value and drop the "GFR Black" value.

dat <- data.frame(
  ACCT = c("1088","1088","1088","1088","44867","44867","44867","1564","1564"),
  TEST_DATE = c("4/4/2016","4/4/2016","4/4/2016","4/4/2016","4/4/2016","4/4/2016","4/4/2016","4/4/2016","4/4/2016"),
  TEST_CODE = c("GFR Caucas", "GFR Caucas", "GFR Black","GFR Black","GFR", "GFR Caucas", "GFR Black", "GFR Caucas", "GFR Black"),
  Value = c("58", "58", "47", "47", "45", "33", "27", "30", "30"),
  RACE = c("White","White","White","White","Black", "Black", "Black", "White", "White")
)

Please provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) including input data and expected output data in a format that others are able to reproduce in their own R sessions (so, no images of tables) — IceCreamToucan, Jan 29 '21 at 17:43

r2evans · Answer 1 · 2021-01-29T18:56:00.150

Try this base R work. You only "need" the first and last lines of this block, the rest are "building".

ptns <- data.frame(RACE=c("White", "Black"), ptn = c("(white|caucas)", "black"))
sapply(ptns$ptn, grepl, x = dat$TEST_CODE, ignore.case = TRUE)
#       (white|caucas) black
#  [1,]           TRUE FALSE
#  [2,]           TRUE FALSE
#  [3,]          FALSE  TRUE
#  [4,]          FALSE  TRUE
#  [5,]          FALSE FALSE
#  [6,]           TRUE FALSE
#  [7,]          FALSE  TRUE
#  [8,]           TRUE FALSE
#  [9,]          FALSE  TRUE

apply(sapply(ptns$ptn, grepl, x = dat$TEST_CODE, ignore.case = TRUE), 1, which.max)
# [1] 1 1 2 2 1 1 2 1 2
ptns$RACE[apply(sapply(ptns$ptn, grepl, x = dat$TEST_CODE, ignore.case = TRUE), 1, which.max)]
# [1] "White" "White" "Black" "Black" "White" "White" "Black" "White" "Black"
dat$RACE == ptns$RACE[apply(sapply(ptns$ptn, grepl, x = dat$TEST_CODE, ignore.case = TRUE), 1, which.max)]
# [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
dat[dat$RACE == ptns$RACE[apply(sapply(ptns$ptn, grepl, x = dat$TEST_CODE, ignore.case = TRUE), 1, which.max)],]
#    ACCT TEST_DATE  TEST_CODE Value  RACE
# 1  1088  4/4/2016 GFR Caucas    58 White
# 2  1088  4/4/2016 GFR Caucas    58 White
# 7 44867  4/4/2016  GFR Black    27 Black
# 8  1564  4/4/2016 GFR Caucas    30 White

An alternative can be had using the fuzzyjoin package that joins based on regexes (and other methods):

library(fuzzyjoin)
# dat and ptns from above
regex_left_join(dat, ptns, by = c("TEST_CODE" = "ptn"), ignore_case = TRUE)
#    ACCT TEST_DATE  TEST_CODE Value RACE.x RACE.y            ptn
# 1  1088  4/4/2016 GFR Caucas    58  White  White (white|caucas)
# 2  1088  4/4/2016 GFR Caucas    58  White  White (white|caucas)
# 3  1088  4/4/2016  GFR Black    47  White  Black          black
# 4  1088  4/4/2016  GFR Black    47  White  Black          black
# 5 44867  4/4/2016        GFR    45  Black   <NA>           <NA>
# 6 44867  4/4/2016 GFR Caucas    33  Black  White (white|caucas)
# 7 44867  4/4/2016  GFR Black    27  Black  Black          black
# 8  1564  4/4/2016 GFR Caucas    30  White  White (white|caucas)
# 9  1564  4/4/2016  GFR Black    30  White  Black          black

subset(regex_left_join(dat, ptns, by = c("TEST_CODE" = "ptn"), ignore_case = TRUE),
       subset = RACE.x == RACE.y)
#    ACCT TEST_DATE  TEST_CODE Value RACE.x RACE.y            ptn
# 1  1088  4/4/2016 GFR Caucas    58  White  White (white|caucas)
# 2  1088  4/4/2016 GFR Caucas    58  White  White (white|caucas)
# 7 44867  4/4/2016  GFR Black    27  Black  Black          black
# 8  1564  4/4/2016 GFR Caucas    30  White  White (white|caucas)

Thank you for your help! I also wanted to note that I have a couple of different test notations for black and white. Following are the test names that I have to consider while picking the result: Black<- c("eGFR-AfA", "GFR Black") White <- c("CMP W/GFR","eGFR-Oth","GFR Caucas", "GFR") — Meghna, Jan 29 '21 at 19:07
That's the reason I started with the assumption that a regex would work well. For instance, change `"black"` to `"(afa$|black$)"` would work, and white to `"(cmp w|-oth|caucas)"` or similar. I don't know your data, so I would tend to be case-insensitive and use `ignore_case=TRUE`. Also, short patterns like `"afa"` have the risk of over-matching (not knowing all possible values), so being as specific as possible is helpful. For instance, if it is always at the end, then `afa$` matches that at the end of a string. — r2evans, Jan 29 '21 at 19:16
If you know a priori all possible values, though, you can forego regex and create `ptns <- data.frame(codes = c("eGFR-AfA", "GFR Black", "CMP W/GFR", ...), RACE = c("Black", "Black", "White", ...))`, and then use (non-`fuzzyjoin`) `merge` or `dplyr::left_join`. — r2evans, Jan 29 '21 at 19:18
When I execute your first code, I only get matches for the "White" race for some reason — Meghna, Jan 29 '21 at 19:32

Filtering data based on race

1 Answers1