1

#using spread of data to determine (descriptive) position within the dataset

code is the following:

jobs_df <- jobs_df %>%
    mutate(description = if_else(quan_value < 'q1' , "Lowest", 
                      if_else(quan_value < 'q2', "Low", 
                              if_else(quan_value < 'q3' , "Medium", 
                                      if_else(quan_value < 'q4' , "High", 
                                              if_else(quan_value < 'q5', "Highest", NA_character_))))))

where "description" for each row in the dataframe should be lowest, low, medium, high, highest and q1, q2, q3, q4, q5 refer to quintile values for the spread of data for "quan_value" column

dataframe is as follows (jobs_df):

jobs         quan_value    q1    q2    q3    q4    q5
  <chr>             <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banker              1.3     2     4     6     8     1
2 Accountant          2.4     2     4     6     8     1
3 Waiter              4.2     2     4     6     8     1
4 Barista             6.3     2     4     6     8     1
5 Train driver        9.1     2     4     6     8     1

"description" is the new column I want based on the if_else statement, however it mostly just retruns "Medium" as the result

user438383
  • 5,716
  • 8
  • 28
  • 43
sajtyrr
  • 15
  • 3
  • There is a typo in the code. You should remove the quotes around the column names i.e. `q1`, `q2` is enough instead of `'q1'`, `'q2'` – akrun Dec 21 '21 at 17:29

1 Answers1

3

Any time I see more than 2 nested if_else (or ifelse or fifelse), I lean towards case_when:

jobs_df %>%
  mutate(description = case_when(
      quan_value < q1 ~ "Lowest", 
      quan_value < q2 ~ "Low", 
      quan_value < q3 ~ "Medium", 
      quan_value < q4 ~ "High", 
      quan_value < q5 ~ "Highest", 
      TRUE ~ NA_character_)
  )
#           jobs quan_value q1 q2 q3 q4 q5 description
# 1       Banker        1.3  2  4  6  8  1      Lowest
# 2   Accountant        2.4  2  4  6  8  1         Low
# 3       Waiter        4.2  2  4  6  8  1      Medium
# 4      Barista        6.3  2  4  6  8  1        High
# 5 Train driver        9.1  2  4  6  8  1        <NA>

Update: since you say your names are a bit non-standard, I'll demonstrate using jobs_df2 (which has what I think are closer to your real names). Notable is that you need to wrap non-compliant object/column names in backticks:

jobs_df2 %>%
  mutate(description = case_when(
      quan_value < `20%` ~ "Lowest", 
      quan_value < `40%` ~ "Low", 
      quan_value < `60%` ~ "Medium", 
      quan_value < `80%` ~ "High", 
      quan_value < `100%` ~ "Highest", 
      TRUE ~ NA_character_)
  )
#           jobs quan_value 20% 40% 60% 80% 100% description
# 1       Banker        1.3   2   4   6   8    1      Lowest
# 2   Accountant        2.4   2   4   6   8    1         Low
# 3       Waiter        4.2   2   4   6   8    1      Medium
# 4      Barista        6.3   2   4   6   8    1        High
# 5 Train driver        9.1   2   4   6   8    1        <NA>

Data

jobs_df <- structure(list(jobs = c("Banker", "Accountant", "Waiter", "Barista", "Train driver"), quan_value = c(1.3, 2.4, 4.2, 6.3, 9.1), q1 = c(2L, 2L, 2L, 2L, 2L), q2 = c(4L, 4L, 4L, 4L, 4L), q3 = c(6L, 6L, 6L, 6L, 6L), q4 = c(8L, 8L, 8L, 8L, 8L), q5 = c(1L, 1L, 1L, 1L, 1L)), row.names = c("1", "2", "3", "4", "5"), class = "data.frame")
jobs_df2 <- structure(list(jobs = c("Banker", "Accountant", "Waiter", "Barista", "Train driver"), quan_value = c(1.3, 2.4, 4.2, 6.3, 9.1), "20%" = c(2L, 2L, 2L, 2L, 2L), "40%" = c(4L, 4L, 4L, 4L, 4L), "60%" = c(6L, 6L, 6L, 6L, 6L), "80%" = c(8L, 8L, 8L, 8L, 8L), "100%" = c(1L, 1L, 1L, 1L, 1L)), row.names = c("1", "2", "3", "4", "5"), class = "data.frame")
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1
    Thank you, I will try this! – sajtyrr Dec 21 '21 at 17:39
  • This is very helpful because the code is much more simple and tidy, however it did not fix my issue. I believe my issue is that the variables in my dataframe are not q1, q2, q3, q4, q5, but are actually labelled as 20%, 40%, 60%, 80% 100%. dplyr does not seem to be responding to these variable names, when changed to the former q...series, the code works! Do you know why this is the case? – sajtyrr Dec 21 '21 at 17:57
  • If your column header is literally that, then enclose them in backticks: `case_when(quan_value < \`q1\` ~ "Lowest", ...)`. The backtick-thing is R's way for you to say "this is not normally an R-preferred variable name, but use it anyway". It's also needed when there are spaces. Ultimately it's to allow arbitrary variable/column names when otherwise the parser will think something else should be done with a "token". (Note that it's a *backtick* `\``, not a single-quote `'`. Very different.) – r2evans Dec 21 '21 at 18:00
  • As a side comment, it's issues like that that make providing sample data in an unambiguous format (i.e., with `dput(.)`) very useful. It would have quickly identified the root cause of your problem causing you to try single-quotes in your first attempt. – r2evans Dec 21 '21 at 18:06
  • 1
    Yes that makes a lot of sense. I was using quotation marks and it was not working. Back ticks will solve this I believe. Thank you so much! – sajtyrr Dec 21 '21 at 18:07
  • I looked into using dput(.) and I did not fully understand what this means – sajtyrr Dec 21 '21 at 18:10
  • (1) The method (for using `dput`) is discussed in https://stackoverflow.com/q/5963269/3358272, one section (*"Copying original data"*) walks you through its use and output. The intent is to paste its output into a code-block in your question. (2) You now have 15 rep points, you now have the [privilege](https://stackoverflow.com/help/privileges) to upvote and are closer to "comment everywhere". Use your new-found powers wisely ;-) – r2evans Dec 21 '21 at 18:12