0

I am looking for an easy way to find whether the value in a column lies within the range of values in other columns.

My input looks like this:

ID  "Q1 Comm - 01 Scope Thesis" "Q1 Comm - 02 Scope Project" "Q1 Comm - 03 Learn Intern"    "Q1 Comm - 04 Biography"    "Q1 Comm - Overall Plan"
10   NA                          NA                           4                              NA      4
31   2                           NA                           NA                             NA      2
225  0                           NA                           NA                             NA      1
243  NA                          2                            NA                             1       0
310  NA                          2                            NA                             1       NA

For each unique ID, I am interested in identifying when the column Q1 Comm - Overall Plan is:

1 - Below the min() of all the other columns, or

2 - Above the max() of all the other columns, or

3 - Within the range of all the other columns

The complete list of columns (along with the overall column) is below:

"Q1 Comm - 01 Scope Thesis"
"Q1 Comm - 02 Scope Project"
"Q1 Comm - 03 Learn Intern"
"Q1 Comm - 04 Biography"
"Q1 Comm - 05 Exhibit"
"Q1 Comm - 06 Social Act"
"Q1 Comm - 07 Post Project"
"Q1 Comm - 08 Learn Plant"
"Q1 Comm - 09 Study Narrate"
"Q1 Comm - 10 Learn Participate"
"Q1 Comm - 11 Write 1"
"Q1 Comm - 12 Read 2"
"Q1 Comm - Overall Plan"

My required output is something like this:

ID  "Q1 Comm - 01 Scope Thesis" "Q1 Comm - 02 Scope Project" "Q1 Comm - 03 Learn Intern"    "Q1 Comm - 04 Biography"    "Q1 Comm - Overall Plan" "Q1_check"
10   NA                          NA                           4                              NA      4 "within"
31   2                           NA                           NA                             NA      2 "within"
225  0                           NA                           NA                             NA      1 "above"
243  NA                          2                            NA                             1       0 "below"
310  NA                          2                            NA                             1       NA NA

The dput() for my data frame df is below.

dput(df)

structure(list(ID = c(10L, 31L, 225L, 243L), Q1.Comm...01.Scope.Thesis = c(NA, 
2L, 0L, NA), Q1.Comm...02.Scope.Project = c(NA, NA, NA, 2L), 
    Q1.Comm...03.Learn.Intern = c(4L, NA, NA, NA), Q1.Comm...04.Biography = c(NA, 
    NA, NA, 1L), Q1.Comm...Overall.Plan = c(4L, 1L, 2L, 
    NA), X = c(NA, NA, NA, NA), X.1 = c(NA, NA, NA, NA), X.2 = c(NA, 
    NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))

Note:

I had asked this question here Finding if a value is within the range of other columns but the example was too simplified and none of the solutions worked for me.

The question was getting too lengthy, therefore, for the sake of clarity, I am posting this as a new question.

I thank you for your time and help on this post.

Sandy
  • 1,100
  • 10
  • 18
  • Your posted `dput` does not match your expected result. – Ian Campbell Jun 09 '21 at 12:58
  • Thank you, I have updated it. It actually contained a list of columns similar to my original data. – Sandy Jun 09 '21 at 13:00
  • @Sandy, doing it for one Q is rather easy, but for several Qs at once require thinking. :) shall I post an answer for this dput? I am sorry to say that this question is again too simplified for your actual requirement!! – AnilGoyal Jun 09 '21 at 13:21

3 Answers3

1

You might try something like this with rowwise and c_across:

library(dplyr)
df %>%
  rowwise %>%
  summarise(ID = ID,
            Max = `Q1.Comm...Overall.Plan` > max(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE),
            Min = `Q1.Comm...Overall.Plan` < min(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE),
            Range = `Q1.Comm...Overall.Plan` >= range(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE)[1] &
                    `Q1.Comm...Overall.Plan` <= range(c_across(-c(ID,`Q1.Comm...Overall.Plan`)),na.rm = TRUE)[2]) %>%
  mutate(Result = case_when(Max ~ "above",
                            Min ~ "below",
                            Range ~ "within",
                            TRUE ~ NA_character_))
# A tibble: 4 x 5
     ID Max   Min   Range Result
  <int> <lgl> <lgl> <lgl> <chr> 
1    10 FALSE FALSE TRUE  within
2    31 FALSE FALSE TRUE  within
3   225 TRUE  FALSE FALSE above 
4   243 NA    NA    NA    NA    

You can change summarise to mutate to keep the original columns and/or select to get rid of them.

See the dplyr rowwise tutorial for more information.

Ian Campbell
  • 23,484
  • 14
  • 36
  • 57
1
library(purrr)
library(data.table)

needed_cols <- setdiff(names(df), c("ID", "Q1.Comm...Overall.Plan"))

setDT(df)[, c("min", "max") := transpose(pmap(.SD, range, na.rm = TRUE)), .SDcols = needed_cols]
df[, Q1_check := fcase(
    is.na(`Q1.Comm...Overall.Plan`), NA_character_,
    `Q1.Comm...Overall.Plan` < min, "below",
    `Q1.Comm...Overall.Plan` > max, "above",
    default = "within"
  )
]
df[, c("max", "min") := NULL]
det
  • 5,013
  • 1
  • 8
  • 16
  • Thanks so much @det. It seems to work for the given df. A quick question, can it be used inside a for loop, as I have additional variables beginning with name like Q2, then another set of variables beginning with names like Q3. I can, of course run your neat script 3 times or more. But just wondering if it can be updated please? – Sandy Jun 09 '21 at 13:48
  • Your suggested solution is also working on my original dataset. I just have to repeat it for each of the questions separately. I will accept it. Thank you! – Sandy Jun 10 '21 at 00:47
  • it can be, but I don't know exactly what you want to do inside loop. It might be even possible to do it without loop. – det Jun 10 '21 at 05:34
  • I've added answer in your other question hope that helps you regarding loop (this question have nothing to do with Q2). – det Jun 10 '21 at 08:26
0

I have revised your dput to suit to your requirements as discussed in the linked question. I think this will help you. I used janitor::clean_names() which I suggest you to use before proceeding so that your column names are cleaned.

So the revised dput is

df <- structure(list(id = c(10L, 31L, 225L, 243L), q1_comm_01_scope_thesis = c(NA, 
2L, 0L, NA), q1_comm_02_scope_project = c(NA, NA, NA, 2L), q1_comm_03_learn_intern = c(4L, 
NA, NA, NA), q1_comm_04_biography = c(NA, NA, NA, 1L), q1_comm_overall_plan = c(4L, 
1L, 2L, NA), q2_comm_01_scope_thesis = c(NA, 4, 0, NA), q2_comm_02_scope_project = c(NA, 
NA, NA, 4), q2_comm_03_learn_intern = c(8, NA, NA, NA), q2_comm_04_biography = c(NA, 
NA, NA, 2), q2_comm_overall_plan = c(8, 2, 4, NA)), row.names = c(NA, 
-4L), class = "data.frame")

df
   id q1_comm_01_scope_thesis q1_comm_02_scope_project q1_comm_03_learn_intern q1_comm_04_biography q1_comm_overall_plan q2_comm_01_scope_thesis
1  10                      NA                       NA                       4                   NA                    4                      NA
2  31                       2                       NA                      NA                   NA                    1                       4
3 225                       0                       NA                      NA                   NA                    2                       0
4 243                      NA                        2                      NA                    1                   NA                      NA
  q2_comm_02_scope_project q2_comm_03_learn_intern q2_comm_04_biography q2_comm_overall_plan
1                       NA                       8                   NA                    8
2                       NA                      NA                   NA                    2
3                       NA                      NA                   NA                    4
4                        4                      NA                    2                   NA

Now proceed as suggested. You have to modify [-5] inside cur_data() to suit to your requirements (as per the relative location of overall_column I think 9 in your case)

library(tidyverse)

split.default(df[-1], gsub('(q\\d*)(.*)', '\\1', names(df[-1]), perl = T)) %>%
  map(., ~ .x %>% bind_cols('id' = df$id) %>%
        group_by(id) %>%
        mutate(across(ends_with('_overall_plan'), ~ case_when(. < min(cur_data()[-5], na.rm = T) ~ 'below',
                                                              . > max(cur_data()[-5], na.rm = T) ~ 'above',
                                                              is.na(.) ~ NA_character_,
                                                              TRUE ~ 'within'),
                      .names = '{str_remove(.col,"_comm_overall_plan")}_check'))
        ) %>%
  reduce(left_join, by = 'id')

# A tibble: 4 x 13
# Groups:   id [4]
  q1_comm_01_scop~ q1_comm_02_scop~ q1_comm_03_lear~ q1_comm_04_biog~ q1_comm_overall~    id q1_check q2_comm_01_scop~ q2_comm_02_scop~ q2_comm_03_lear~ q2_comm_04_biog~
             <int>            <int>            <int>            <int>            <int> <int> <chr>               <dbl>            <dbl>            <dbl>            <dbl>
1               NA               NA                4               NA                4    10 within                 NA               NA                8               NA
2                2               NA               NA               NA                1    31 below                   4               NA               NA               NA
3                0               NA               NA               NA                2   225 above                   0               NA               NA               NA
4               NA                2               NA                1               NA   243 NA                     NA                4               NA                2
# ... with 2 more variables: q2_comm_overall_plan <dbl>, q2_check <chr>
AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • 1
    Thank you for letting me know about the janitor() package. However, the solution is not working on my original dataset. – Sandy Jun 10 '21 at 00:46
  • That's why I asked you to post a few rows (4 may be) of your complete dataset. Tell me one more thing, are IDs unique for each row? – AnilGoyal Jun 10 '21 at 01:09
  • Yes, the IDs are unique. I have pre-processed the data to be in ```wide``` format where all information related to an ID is within their own row. – Sandy Jun 10 '21 at 01:15