1

I have a dataframe df which looks like this:

Input:

df <- read.table(text = 

"ID  Q1_PM Q1_TP Q1_overall  Q2_PM  Q2_LS  Q2_overall
 1   1     2     3           1       2     2       
 2   0     NA    NA          2       1     1 
 3   2     1     1           3       4     0  
 4   1     0     2           4       0     2 
 5   NA    1     NA          0       NA    0  
 6   2     0     1           1       NA    NA"   

, header = TRUE)

Desired Output:

To explain a little further, my desired output is as below:

 ID  Q1_PM Q1_TP Q1_overall  Q2_PM  Q2_LS  Q2_overall Q1_check  Q2_check
 1   1     2     3           1       2     2          "above"   "within"
 2   0     NA    NA          2       1     1           NA       "within"
 3   2     1     1           3       4     0          "within"  "below"
 4   1     0     2           4       0     2          "above"   "within"
 5   NA    1     NA          0       NA    0           NA       "within"
 6   2     0     1           1       NA    NA         "within"   NA

Explanation:

Example 1:

Based on the value in columns Q1_PM and Q1_TP, I want to see whether the value in column Q1_overall is within their range or not? If, not in range, is the value above or below the range? To track this, I want to add an additional column Q1_check.

Example 2:

Similarly, based on the values of Q2_PM and Q2_LS, I want to check if the value of Q2_overall is within their range or not? If not in range, is it above or below the range? Again, to track this, I want to add an additional column Q2_check

Requirements:

1- For this, I want to add additional columns Q1_check and Q2_check where the first column is for the comparisons that involve Q1 items and the second column is for the comparisons that involve Q2 items.

2- The columns could contain the following values: above, below and within.

3- The case when the columns named overall have NAs, then the extra columns could also have NAs.

Related posts:

I have looked for related posts such as: Add column with values depending on another column to a dataframe and Create categories by comparing a numeric column with a fixed value but I am running into errors as discussed below.

Partial Solution:

The only solution, I can think of is, along these lines:

df$Q1_check <- ifelse(data$Q1_overall < data$Q1_PM, 'below',
                        ifelse(data$Q1_overall > data$Q1_TP, 'above', 
                               ifelse(is.na(data$Q1_overall), NA, 'within')))

But it results in following error: Error in data$Q1_overall : object of type 'closure' is not subsettable. I do not understand what the possible issue could be.

OR

df %>%
  mutate(Regulation = case_when(Q1_overall < Q1_PM ~ 'below',
                                Q1_overall > Q1_TP ~ 'above', 
                                Q1_PM < Q1_overall < Q1_TP, 'within'))

This also results in error Error: unexpected '<' in: "Q1_overall > Q1_TP ~ 'above', Q1_PM < Q1_overall <"

Edit 1:

How can the solution be extended if (let's say) the columns are these:

"Q1 Comm - 01 Scope Thesis"
"Q1 Comm - 02 Scope Project"
"Q1 Comm - 03 Learn Intern"
"Q1 Comm - 04 Biography"
"Q1 Comm - 05 Exhibit"
"Q1 Comm - 06 Social Act"
"Q1 Comm - 07 Post Project"
"Q1 Comm - 08 Learn Plant"
"Q1 Comm - 09 Study Narrate"
"Q1 Comm - 10 Learn Participate"
"Q1 Comm - 11 Write 1"
"Q1 Comm - 12 Read 2"
"Q1 Comm - Overall Study Plan"

How can we identify when the column Q1 Comm - Overall Study Plan is:

1 - Below the min() of all the other columns, or

2 - Above the max() of all the other columns, or

3 - Within the range of all the other columns

Edit 2:

For the updated fields, I am also including the dput(df)

dput(df)

structure(list(ï..ID = c(10L, 31L, 225L, 243L), Q1.Comm...01.Scope.Thesis = c(NA, 
2L, 0L, NA), Q1.Comm...02.Scope.Project = c(NA, NA, NA, 2L), 
    Q1.Comm...03.Learn.Intern = c(4L, NA, NA, NA), Q1.Comm...04.Biography = c(NA, 
    NA, NA, 1L), Q1.Comm...05.Exhibit = c(4L, 2L, NA, NA), Q1.Comm...06.Social.Act = c(NA, 
    NA, NA, 3L), Q1.Comm...07.Post.Project = c(NA, NA, 3L, NA
    ), Q1.Comm...08.Learn.Plant = c(NA, NA, NA, 4L), Q1.Comm...09.Study.Narrate = c(NA, 
    NA, 0L, NA), Q1.Comm...10.Learn.Participate = c(4L, NA, NA, 
    NA), Q1.Comm...11.Write.1 = c(NA, 2L, NA, NA), Q1.Comm...12.Read.2 = c(NA, 
    NA, 1L, NA), Q1.Comm...Overall.Study.Plan = c(4L, 1L, 2L, 
    NA), X = c(NA, NA, NA, NA), X.1 = c(NA, NA, NA, NA), X.2 = c(NA, 
    NA, NA, NA)), class = "data.frame", row.names = c(NA, -4L
))

Any advice on how to achieve this would be greatly appreciated. Thank you!

Sandy
  • 1,100
  • 10
  • 18
  • The error you get is because `data$Q1-overall` is parsed as `data$Q1 - overall`; you want `data$\`Q1-overall\`` – Bas Jun 09 '21 at 06:53
  • Why are the column names inconsistent? I.e., why `Q1-TP` and not `Q1-LS`, for example? – Bas Jun 09 '21 at 06:54
  • There are multiple different measures in my case, to mock that up, I made ```Q1-TS``` and ```Q1-LS```. If having both as ```TP``` or both as ```LS``` helps simplify the solution, please do suggest. I could do other manipulation. – Sandy Jun 09 '21 at 06:57
  • @Sandy, from which columns (edited scenario) you want to compare `Q1 Comm - Overall Study Plan` say? I'm not sure what are you after? – AnilGoyal Jun 09 '21 at 11:22
  • 1
    Thank you @AnilGoyal for your quick reply. I have updated my question to address your comment. Is it any clear now? – Sandy Jun 09 '21 at 11:24
  • 1
    I have also included the dput(df) for the extended version. – Sandy Jun 09 '21 at 12:25

5 Answers5

1

Seems a very long winded approach -

library(dplyr)

comparison <- function(x, y, z) {
  case_when(is.na(z) ~ NA_character_,
            z >= x & z <= y | 
              z >= y & z <= x |
              is.na(x) & y == z |
              is.na(y) & x == z ~ 'within', 
            z > x & z > y ~ 'above', 
            TRUE ~ 'below')
}

df %>%
  mutate(Q1_check = comparison(Q1.PM, Q1.TP, Q1.overall), 
          Q2_check = comparison(Q2.PM, Q2.LS, Q2.overall))
  
         
#  ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#1  1     1     2          3     1     2          2    above   within
#2  2     0    NA         NA     2     1          1     <NA>   within
#3  3     2     1          1     3     4          0   within    below
#4  4     1     0          2     4     0          2    above   within
#5  5    NA     1         NA     0    NA          0     <NA>   within
#6  6     2     0          1     1    NA         NA   within     <NA>
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thank you @Ronak Shah, how can I apply it to several columns (let's say 8 different columns), it would make this solution quite complicated... – Sandy Jun 09 '21 at 06:59
  • 1
    What do you mean? So instead of `PM`, `TP` and `Overall` there are other columns in your data? What are the rules to check in them. – Ronak Shah Jun 09 '21 at 07:02
  • I just want to perform column comparisons to see if the ```overall``` column is within the range of the other columns. For example, having some maximum and minimum and then seeing if the overall column falls within that maximum and minimum. Does it answer your question @Ronak Shah? – Sandy Jun 09 '21 at 07:56
  • I have edited my question @Ronak Shah, and I have provided my actual column names, can you advise based on that, please? – Sandy Jun 09 '21 at 10:58
1
df <- read.table(text = 
                   
                   "ID  Q1-PM Q1-TP Q1-overall  Q2-PM  Q2-LS  Q2-overall
 1   1     2     3           1       2     2       
 2   0     NA    NA          2       1     1 
 3   2     1     1           3       4     0  
 4   1     0     2           4       0     2 
 5   NA    1     NA          0       NA    0  
 6   2     0     1           1       NA    NA"   
                 
                 , header = TRUE)

library(tidyverse)


f <- function(x, y, z){
  case_when(
    z < pmin(x, y, na.rm = TRUE) ~ "below",
    z > pmax(x, y, na.rm = TRUE) ~ "abowe",
    between(z, pmin(x, y, na.rm = TRUE), pmax(x, y, na.rm = TRUE)) ~ "within"
  )
}

df %>%
  rowwise() %>% 
  mutate(Q1_check = f(Q1.PM, Q1.TP, Q1.overall),
         Q2_check = f(Q2.PM, Q2.LS, Q2.overall))
#> # A tibble: 6 x 9
#> # Rowwise: 
#>      ID Q1.PM Q1.TP Q1.overall Q2.PM Q2.LS Q2.overall Q1_check Q2_check
#>   <int> <int> <int>      <int> <int> <int>      <int> <chr>    <chr>   
#> 1     1     1     2          3     1     2          2 abowe    within  
#> 2     2     0    NA         NA     2     1          1 <NA>     within  
#> 3     3     2     1          1     3     4          0 within   below   
#> 4     4     1     0          2     4     0          2 abowe    within  
#> 5     5    NA     1         NA     0    NA          0 <NA>     within  
#> 6     6     2     0          1     1    NA         NA within   <NA>

Created on 2021-06-09 by the reprex package (v2.0.0)

Yuriy Saraykin
  • 8,390
  • 1
  • 7
  • 14
  • Thank you @Yuriy Saraykin, any chance we could update the function to calculate ```pmin()``` or ```pmax()``` of more than two columns? – Sandy Jun 09 '21 at 10:26
1

Largely based on Ronak's great solution:

df <- structure(list(ID = c(10L, 31L, 225L, 243L), 
                      `Q1 Comm - 01 Scope Thesis` = c(NA, 2L, 0L, NA), 
                      `Q1 Comm - 02 Scope Project` = c(NA, NA, NA, 2L), 
                      `Q1 Comm - 03 Learn Intern` = c(4L, NA, NA, NA), 
                      `Q1 Comm - 04 Biography` = c(NA, NA, NA, 1L), 
                      `Q1 Comm - 05 Exhibit` = c(4L, 2L, NA, NA), 
                      `Q1 Comm - 06 Social Act` = c(NA, NA, NA, 3L), 
                      `Q1 Comm - 07 Post Project` = c(NA, NA, 3L, NA), 
                      `Q1 Comm - 08 Learn Plant` = c(NA, NA, NA, 4L), 
                      `Q1 Comm - 09 Study Narrate` = c(NA, NA, 0L, NA), 
                      `Q1 Comm - 10 Learn Participate` = c(4L, NA, NA,NA), 
                      `Q1 Comm - 11 Write 1` = c(NA, 2L, NA, NA), 
                      `Q1 Comm - 12 Read 2` = c(NA, NA, 1L, NA), 
                      `Q1 Comm - Overall Study Plan` = c(4L, 1L, 2L, NA), 
                      X = c(NA, NA, NA, NA), 
                      `X 1` = c(NA, NA, NA, NA), 
                      `X 2` = c(NA, NA, NA, NA)), 
                class = "data.frame", row.names = c(NA, -4L))

library(dplyr)

comparison <- function(df, prefix) {
  df <- df[grep(prefix, colnames(df))]
  min <- apply(df[-grep("Overall", colnames(df))], 1, min, na.rm = T)
  max <- apply(df[-grep("Overall", colnames(df))], 1, max, na.rm = T)
  z <- df[grep("Overall", colnames(df))]
  case_when(is.na(z) ~ NA_character_,
            z >= min & z <= max ~ 'within', 
            z > max ~ 'above', 
            TRUE ~ 'below')
}

prefixes <- sub(" \\- Overall.*", "", colnames(df[grep("Overall", colnames(df))]))

for (i in prefixes) {
  df <- df %>%
    mutate("{i} - Check" := comparison(df, i))
}


> print(df)
   ID Q1 Comm - 01 Scope Thesis Q1 Comm - 02 Scope Project Q1 Comm - 03 Learn Intern Q1 Comm - 04 Biography
1  10                        NA                         NA                         4                     NA
2  31                         2                         NA                        NA                     NA
3 225                         0                         NA                        NA                     NA
4 243                        NA                          2                        NA                      1
  Q1 Comm - 05 Exhibit Q1 Comm - 06 Social Act Q1 Comm - 07 Post Project Q1 Comm - 08 Learn Plant
1                    4                      NA                        NA                       NA
2                    2                      NA                        NA                       NA
3                   NA                      NA                         3                       NA
4                   NA                       3                        NA                        4
  Q1 Comm - 09 Study Narrate Q1 Comm - 10 Learn Participate Q1 Comm - 11 Write 1 Q1 Comm - 12 Read 2
1                         NA                              4                   NA                  NA
2                         NA                             NA                    2                  NA
3                          0                             NA                   NA                   1
4                         NA                             NA                   NA                  NA
  Q1 Comm - Overall Study Plan  X X 1 X 2 Q1 Comm - Check
1                            4 NA  NA  NA          within
2                            1 NA  NA  NA           below
3                            2 NA  NA  NA          within
4                           NA NA  NA  NA            <NA>
koolmees
  • 2,725
  • 9
  • 23
  • For comparison or finding the range of more than 2 columns, it would be hard to apply this solution? – Sandy Jun 09 '21 at 10:16
  • I tried to give a solution based on what you wrote under Ronak's solution. What it does is it grabs the prefix of all columns that have ".overall" in their name (so Q1 and Q2 in your example) and creates a check column, no matter if you apply it to 2 columns or 200. The only potential pitfalls are that there needs to be 3 columns per prefix and they need to be in the same order as you put in your example (min, max, overall) – koolmees Jun 09 '21 at 10:25
  • I have edited my question and have provided my actual column names, can you advise based on that, please? – Sandy Jun 09 '21 at 10:57
  • Edited my solution to meet your request. It now checks the min and max of every row within the columns that match the prefix (excluding Overall) and compares that with the Overall of said prefix – koolmees Jun 09 '21 at 12:17
  • Thank you for the update, I have included a reference ```dput(df)``` a sample where I am trying to run the code. – Sandy Jun 09 '21 at 12:27
  • I have tried this script, it is returning ```above``` for all rows. Some where the logic is not working... – Sandy Jun 09 '21 at 12:35
  • I don't experience the same issue when using your dput and the column names you had provided in your first edit. Maybe you have a package loaded that overwrites one of the base functions that are used? – koolmees Jun 09 '21 at 13:25
1

If your columns are named similarly, you may do this for any number of Qs simultaneously.

  • changed - in column names to acceptable _
  • changed Q2_LS to Q2_TP for sake of similarity

What is does -

  • It picks up every column that ends with _overall (2 here but can be any number)
  • check this columns values as -
    • If less than column having name _PM / _TP in lieu of _overall allocates value below
    • If greater than column having name _PM/_TP in lieu of _overall allocates value above
      • To access these column values I used get alongwith cur_column and stringr string replacement function
    • if current value is NA allocated a NA_character
    • otherwise allocates value within
  • Now, for final mutated columns (all at once) it renames these by removing _overall from these columns and pasting _check instead (I used .names argument of across here)
    • For this I used stringr::str_remove inside glue argument (.names follow glue style of formula)
df <- read.table(text = 
                   
                   "ID  Q1_PM Q1_TP Q1_overall  Q2_PM  Q2_TP  Q2_overall
 1   1     2     3           1       2     2       
 2   0     NA    NA          2       1     1 
 3   2     1     1           3       4     0  
 4   1     0     2           4       0     2 
 5   NA    1     NA          0       NA    0  
 6   2     0     1           1       NA    NA"   
 
 , header = TRUE)

df
#>   ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall
#> 1  1     1     2          3     1     2          2
#> 2  2     0    NA         NA     2     1          1
#> 3  3     2     1          1     3     4          0
#> 4  4     1     0          2     4     0          2
#> 5  5    NA     1         NA     0    NA          0
#> 6  6     2     0          1     1    NA         NA
library(tidyverse)
df %>% mutate(across(ends_with('overall'), ~ case_when(. < pmin(get(str_replace(cur_column(), '_overall', '_PM')),
                                                                 get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'below',
                                                       . > pmax(get(str_replace(cur_column(), '_overall', '_PM')),
                                                                 get(str_replace(cur_column(), '_overall', '_TP'))) ~ 'above',
                                                       is.na(.) ~ NA_character_,
                                                       TRUE ~ 'within'),
                     .names = '{str_remove(.col,"_overall")}_check'))
#>   ID Q1_PM Q1_TP Q1_overall Q2_PM Q2_TP Q2_overall Q1_check Q2_check
#> 1  1     1     2          3     1     2          2    above   within
#> 2  2     0    NA         NA     2     1          1     <NA>   within
#> 3  3     2     1          1     3     4          0   within    below
#> 4  4     1     0          2     4     0          2    above   within
#> 5  5    NA     1         NA     0    NA          0     <NA>   within
#> 6  6     2     0          1     1    NA         NA   within     <NA>

Created on 2021-06-09 by the reprex package (v2.0.0)

AnilGoyal
  • 25,297
  • 4
  • 27
  • 45
  • 1
    Thank you @AnilGoyal, can you please explain your solution, for me to better understand? – Sandy Jun 09 '21 at 10:30
  • 1
    Yes why not @Sandy, please check the edit. The reason for changing `-` to `_` is simply R requirements/naming conventions for not using `-` inside names – AnilGoyal Jun 09 '21 at 10:38
  • 1
    Thank you @AnilGoyal. I have slightly updated my question and have included the actual column names that need to be compared with the ```overall``` column. I have copied them in their actual order (mentioning just in case if that has any implications). Could you please advise based on that? – Sandy Jun 09 '21 at 10:53
  • @Sandy, you have included column names but probably you forgot to include logic/criteria for output!! – AnilGoyal Jun 09 '21 at 11:20
  • @Sandy, got it. Can you please include a few rows of sample data to try. Moreover, how many Qs scenarios do you have? – AnilGoyal Jun 09 '21 at 11:35
  • 1
    I have a total of nine (9) question scenarios i.e., ```Q1:Q9``` and the same items are repeated in the same sequence for each of the question scenarios. The data is in wide format and there are no duplicate rows. – Sandy Jun 09 '21 at 11:43
1
comparison <- function(dt, group_cols, new_col, compare_col){
  
  dt[, 
     c("min", "max") := transpose(pmap(.SD, range, na.rm = TRUE)), .SDcols = group_cols
     ][,(new_col) := fcase(
       is.na(get(compare_col)), NA_character_,
       get(compare_col) < min, "below",
       get(compare_col) > max, "above",
       default = "within"
     )
     ][]
}

group_cols <- names(df) %>%
 str_subset("^Q[0-9]+") %>%
 str_subset("overall", negate = TRUE) %>%
 split(str_extract(., "^Q[0-9]+"))

new_cols <- names(group_cols) %>% str_c("_check")
compare_cols <- names(group_cols) %>% str_c("_overall")

setDT(df)

pwalk(list(group_cols, new_cols, compare_cols), ~comparison(df, ...))
df[, c("min", "max") := NULL]
det
  • 5,013
  • 1
  • 8
  • 16