Adding a column according to norm data in R

Question

I have a longitudinal dataset in the long form with the length of around 2800, with around 400 participants in total. Here's a sample of my data.

#    ID  wave score sex age edu 
#1  1001 1   28     1 69  12
#2  1001 2   27     1 70  12
#3  1001 3   28     1 71  12
#4  1001 4   26     1 72  12
#5  1002 1   30     2 78   9
#6  1002 3   30     2 80   9
#7  1003 1   30     2 65  16
#8  1003 2   30     2 66  16
#9  1003 3   29     2 67  16
#10 1003 4   28     2 68  16
#11 1004 1   22     2 85   4
#12 1005 1   20     2 60   9
#13 1005 2   18     1 61   9
#14 1006 1   22     1 74   9
#15 1006 2   23     1 75   9
#16 1006 3   25     1 76   9
#17 1006 4   19     1 77   9

I want to create a new column "cutoff" with values "Normal" or "Impaired" because my outcome data, "score" has a cutoff score indicating impairment according to norm. The norm consists of different -1SD measures(the cutoff point) according to Sex, Edu(year of education), and Age.

Below is what I'm currently doing, checking an excel file myself and putting in the corresponding cutoff score according to the three conditions. First of all, I am not sure if I am creating the right column.

data$cutoff <- ifelse(data$sex==1 & data$age<70
                  & data$edu<3
                  & data$score<19.91, "Impaired", "Normal")
data$cutoff <- ifelse(data$sex==2 & data$age<70
                  & data$edu<3
                  & data$score<18.39, "Impaired", "Normal")

Additionally, I am wondering if I can import an excel file stating the norm, and create a column according to the values in it.

The excel file has a structure as shown below.

#      Sex  Male                      Female            
#60-69 Edu(yr)  0-3 4-6 7-12   13>=   0-3   4-6 7-12    13>=
#Age   Number   22  51  119    72     130   138 106     51
#      Mean   24.45 26.6 27.06 27.83  23.31 25.86   27.26   28.09
#      SD     3.03  1.89    1.8 1.53  3.28  2.55    1.85    1.44
#     -1.5SD' 19.92 23.27   23.76   24.8    18.53   21.81   23.91   25.15
#70-79 Edu(yr)  0-3 4-6 7-12   13>=   0-3   4-6 7-12    13>=
....

I have created new columns "agecat" and "educat," allocating each ID into a group of age and education used in the norm. Now I want to make use of these columns, matching it with rows and columns of the excel file above. One of the motivations is to create a code that can be used for further research using the test scores of my data.

I think ideally you can do this using non-equi joins (https://stackoverflow.com/questions/41043047/complexe-non-equi-merge-in-r). But in the meantime it's probably easier to use `ifelse` or `case_when` from the `dplyr` package (https://stackoverflow.com/questions/54690043/conditionally-convert-strings-to-a-specific-numeric-value/54690774#54690774) . — hannes101, Feb 19 '19 at 15:12
It's hard to tell what's going on with the data you said is in an Excel file. Maybe you can read it in and provide the data with `dput`? [See here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) on making a reproducible R question that's easy for folks to help with — camille, Feb 19 '19 at 16:11

score 0 · Accepted Answer · answered Feb 19 '19 at 15:30

I think your ifelse statements should work fine, but I would definitely import the Excel file rather than hardcoding it, though you may need to structure it a bit differently. I would structure it just like a dataset, with columns for Sex, Edu, Age, Mean, SD, -1.5SD, etc., import it into R, then do a left outer join on Sex+Edu+Age:

merge(x = long_df, y = norm_df, by = c("Sex", "Edu(yr)", "Age"), all.x = TRUE)

Then you can compare the columns directly.

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

If I understand correctly, the OP wants to mark a certain type of outliers in his dataset. So, there are two tasks here:

Compute the statistics mean(score), sd(score), and cutoff value mean(score) - 1.5 * sd(score) for each group of sex, age category agecat, and edu category edcat.
Find all rows where score is lower than the cutoff value for the particular group.

As already mentioned by hannes101, the second step can be implemented by a non-equi join.

library(data.table)

# categorize age and edu (left closed intervals)
mydata[, c("agecat", "educat") := .(cut(age, c(seq(0, 90, 10), Inf), right = FALSE),
                                    cut(edu, c(0, 4, 7, 13, Inf), right = FALSE))][]
# compute statistics
cutoffs <- mydata[, .(.N, Mean = mean(score), SD = sd(score), 
                      m1.5SD = mean(score) - 1.5 * sd(score)),
                  by = .(sex, agecat, educat)]

# non-equi update join
mydata[, cutoff := "Normal"]
mydata[cutoffs, on = .(sex, agecat, educat, score < m1.5SD), cutoff := "Impaired"][]

mydata

      ID wave score sex age edu  agecat   educat   cutoff
 1: 1001    1    28   1  69  12 [60,70)   [7,13)   Normal
 2: 1001    2    27   1  70  12 [70,80)   [7,13)   Normal
 3: 1001    3    28   1  71  12 [70,80)   [7,13)   Normal
 4: 1001    4    26   1  72  12 [70,80)   [7,13)   Normal
 5: 1002    1    30   2  78   9 [70,80)   [7,13)   Normal
 6: 1002    3    30   2  80   9 [80,90)   [7,13)   Normal
 7: 1003    1    33   2  65  16 [60,70) [13,Inf)   Normal
 8: 1003    2    32   2  66  16 [60,70) [13,Inf)   Normal
 9: 1003    3    31   2  67  16 [60,70) [13,Inf)   Normal
10: 1003    4    24   2  68  16 [60,70) [13,Inf) Impaired
11: 1004    1    22   2  85   4 [80,90)    [4,7)   Normal
12: 1005    1    20   2  60   9 [60,70)   [7,13)   Normal
13: 1005    2    18   1  61   9 [60,70)   [7,13)   Normal
14: 1006    1    22   1  74   9 [70,80)   [7,13)   Normal
15: 1006    2    23   1  75   9 [70,80)   [7,13)   Normal
16: 1006    3    25   1  76   9 [70,80)   [7,13)   Normal
17: 1006    4    19   1  77   9 [70,80)   [7,13)   Normal
18: 1007    1    33   2  65  16 [60,70) [13,Inf)   Normal
19: 1007    2    32   2  66  16 [60,70) [13,Inf)   Normal
20: 1007    3    31   2  67  16 [60,70) [13,Inf)   Normal
21: 1007    4    31   2  68  16 [60,70) [13,Inf)   Normal
      ID wave score sex age edu  agecat   educat   cutoff

In this made-up example there is only one row which meets the "Impaired" conditions.

Likewise, the statistics is rather sparsely populated:

cutoffs

   sex  agecat   educat N     Mean       SD   m1.5SD
1:   1 [60,70)   [7,13) 2 23.00000 7.071068 12.39340
2:   1 [70,80)   [7,13) 7 24.28571 3.147183 19.56494
3:   2 [70,80)   [7,13) 1 30.00000       NA       NA
4:   2 [80,90)   [7,13) 1 30.00000       NA       NA
5:   2 [60,70) [13,Inf) 8 30.87500 2.900123 26.52482
6:   2 [80,90)    [4,7) 1 22.00000       NA       NA
7:   2 [60,70)   [7,13) 1 20.00000       NA       NA

Data

OP's sample dataset has been modified in one group for demonstration.

library(data.table)

mydata <- fread("
#    ID  wave score sex age edu 
#1  1001 1   28     1 69  12
#2  1001 2   27     1 70  12
#3  1001 3   28     1 71  12
#4  1001 4   26     1 72  12
#5  1002 1   30     2 78   9
#6  1002 3   30     2 80   9
#7  1003 1   33     2 65  16
#8  1003 2   32     2 66  16
#9  1003 3   31     2 67  16
#10 1003 4   24     2 68  16
#11 1004 1   22     2 85   4
#12 1005 1   20     2 60   9
#13 1005 2   18     1 61   9
#14 1006 1   22     1 74   9
#15 1006 2   23     1 75   9
#16 1006 3   25     1 76   9
#17 1006 4   19     1 77   9
#18 1007 1   33     2 65  16
#19 1007 2   32     2 66  16
#20 1007 3   31     2 67  16
#21 1007 4   31     2 68  16
", drop = 1L)

Adding a column according to norm data in R

2 Answers2

Data