Convert a data frame of numerics to string equivalents based on a reference range

Question

I have a data.frame of numeric scores related to groups (y) measured across different factors (x) with resulting scores. Similar to the table below.

BU      AUDIT CORC   GOV    PPS   TMSC   TRAIN
Unit1   2.00  0.00   2.00   4.00  1.50   2.50
Unit2   3.00  1.40   3.20   1.00  1.50   3.00
Unit3   2.50  2.40   2.80   3.00  2.75   2.50
Unit4   3.00  3.20   1.60   4.00  1.00   3.00
Unit5   2.00  2.80   2.00   2.00  3.00   2.50

Table is created like this

df %>%
  group_by(BU, CC) %>% #BU = 'unit', CC = 'Control_Category
  summarise(avg = mean(Score, na.rm = TRUE)) %>%
  dcast(BU ~ CC, value.var = "avg") %>% print()

These numeric scores have a reference to a string value, like in the "table" below.

Control_Score >  3.499 ~ "Ineffective",
Control_Score >  2.499  & Control_Score <= 3.499 ~ "Marginally Effective",
Control_Score >= 1.500  & Control_Score <= 2.499 ~ "Generally Effective",
Control_Score >  0.000  & Control_Score <= 1.499 ~ "Highly Effective"

I tried a few apply functions to try and apply a comparison to the values. Also tried mutate with a case_when to no available.

In the end it would be ideal if the table looked like:

BU, AUDIT, CORC, GOV, PPS, TMSC, TRAIN
Unit1, Generally Effective,  Highly Effective, etc, etc
Unit2, Marginally Effective, Highly Effective, etc, etc
Unit3, ...,...,...
Unit4, ...,...,...
Unit5, ...,...,...

shouldn't 0 be highly effective? second column of first row! — M--, Aug 23 '17 at 20:51
The tables I'm showing here are just examples and do not necessarily correspond 1:1. — Zach, Aug 23 '17 at 20:52
OK. But it would be better if they do. [How to make a great reproducible example in R?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) This is a good thread to read. — M--, Aug 23 '17 at 20:53

score 1 · Accepted Answer · edited Jul 24 '20 at 23:10

You can use the case_when from dplyr to do this.

df1 <- read.table(header = TRUE,
  text = 'BU AUDIT CORC GOV PPS TMSC TRAIN
  Unit1   2.0  0.0 2.0   4 1.50   2.5
  Unit2   3.0  1.4 3.2   1 1.50   3.0
  Unit3   2.5  2.4 2.8   3 2.75   2.5
  Unit4   3.0  3.2 1.6   4 1.00   3.0
  Unit5   2.0  2.8 2.0   2 3.00   2.5
  ')

I put the case_when in a function.

score_label <- function(score){
  lbl <- case_when(
    score < 1.5 ~ "Highly Effective",
    score >= 1.5 & score < 2.5 ~ "Generally Effective",
    score >= 2.5 & score < 3.5 ~ "Marginally Effective",
    score >= 3.5 ~ "Ineffective"
  )
  return(lbl)
}

Then use apply to apply the function to the data frame (edited per AOSmith's comment, using mutate_at from dplyr instead of the "apply" function. Much easier to read and follow.)

df_out <- df1 %>% 
    mutate_at(c("AUDIT", "CORC", "GOV", "PPS", "TMSC", "TRAIN"), score_label)

df_Out[,1:4]


   BU                AUDIT                 CORC                  GOV
Unit1  Generally Effective     Highly Effective  Generally Effective
Unit2 Marginally Effective     Highly Effective Marginally Effective
Unit3 Marginally Effective  Generally Effective Marginally Effective
Unit4 Marginally Effective Marginally Effective  Generally Effective
Unit5  Generally Effective Marginally Effective  Generally Effective

Yes, I just tried that and works great but even more readable. — pipertom, Aug 23 '17 at 22:14
love mutate_at -- i totally implemented it in my code! thank you so much! — Zach, Aug 25 '17 at 05:32

Nathan Werth · Answer 2 · 2017-08-23T21:07:27.157

cut is a great function for dividing numbers into intervals and giving them explanatory names.

Control_Score <- c(-1, 0, 1.4, 1.5, 2.499, 2.5, 3.499, 3.5, 4)

cut(
  Control_Score,
  breaks = c(0, 1.5, 2.5, 3.5, Inf),
  labels = c(
    "Highly Effective",
    "Generally Effective",
    "Marginally Effective",
    "Ineffective"
  ),
  include.lowest = TRUE
)
# [1] <NA>                 Highly Effective     Highly Effective    
# [4] Highly Effective     Generally Effective  Generally Effective 
# [7] Marginally Effective Marginally Effective Ineffective         
# 4 Levels: Highly Effective Generally Effective ... Ineffective

As you can see with the -1, any values outside the specified intervals are assigned NA. So invalid data are less likely to slip by unnoticed.

Replacing values in df:

df <- read.table(
  header = TRUE,
  text = 'BU AUDIT CORC GOV PPS TMSC TRAIN
Unit1   2.0  0.0 2.0   4 1.50   2.5
Unit2   3.0  1.4 3.2   1 1.50   3.0
Unit3   2.5  2.4 2.8   3 2.75   2.5
Unit4   3.0  3.2 1.6   4 1.00   3.0
Unit5   2.0  2.8 2.0   2 3.00   2.5
  ')

df[-1] <- lapply(
  df[-1],
  cut,
  breaks = c(0, 1.5, 2.5, 3.5, Inf),
  labels = c(
    "Highly Effective",
    "Generally Effective",
    "Marginally Effective",
    "Ineffective"
  ),
  include.lowest = TRUE
)

df[-1] just means "all but the first column of df". Use whatever subsetting necessary for your actual data.

That looks reasonable. So how does one apply that logic to the existing data.frame of numerics to produce an identical data.frame with strings as opposed to numerics? — Zach, Aug 23 '17 at 21:02

M-- · Answer 3 · 2017-08-23T21:23:41.857

You can also use findInterval() which is also a base R function:

myintervals <- c(-Inf, 0, 1.5, 2.5, 3.5, Inf)
mylabels    <- c(NA, "Highly Effective", "Generally Effective", 
                 "Marginally Effective", "Ineffective")

df[,-1] <- mylabels[sapply(df[,-1], function(x) findInterval(x,myintervals))]

df
##      BU               AUDIT                CORC                 GOV
## 1 Unit1 Generally Effective    Highly Effective Generally Effective
## 2 Unit2 Marginally Effectiv    Highly Effective Marginally Effectiv
## 3 Unit3 Marginally Effectiv Generally Effective Marginally Effectiv
## 4 Unit4 Marginally Effectiv Marginally Effectiv Generally Effective
## 5 Unit5 Generally Effective Marginally Effective Generally Effective
##                   PPS                TMSC               TRAIN
## 1         Ineffective Generally Effective Marginally Effectiv
## 2    Highly Effective Generally Effective Marginally Effectiv
## 3 Marginally Effectiv Marginally Effectiv Marginally Effectiv
## 4         Ineffective    Highly Effective Marginally Effectiv
## 5 Generally Effective Marginally Effective Marginally Effective

Data:

 df <- structure(list(BU = structure(1:5, .Label = c("Unit1", "Unit2",                   
     "Unit3", "Unit4", "Unit5"), class = "factor"), AUDIT = c(2, 3,                      
     2.5, 3, 2), CORC = c(0, 1.4, 2.4, 3.2, 2.8), GOV = c(2, 3.2,                        
     2.8, 1.6, 2), PPS = c(4, 1, 3, 4, 2), TMSC = c(1.5, 1.5, 2.75,                      
     1, 3), TRAIN = c(2.5, 3, 2.5, 3, 2.5)), .Names = c("BU", "AUDIT",                   
     "CORC", "GOV", "PPS", "TMSC", "TRAIN"), row.names = c(NA, 5L), class = "data.frame")

Note: I prefer findInterval() because it gives you an error if you have data outside of your defined boundaries so you would know about them rather than categorizing them as NA by default (what cut does).

How do those intervals correspond to the intervals I have posed in the question? c(-Inf, 0, 1.5, 2.5, 3.5, Inf) — Zach, Aug 24 '17 at 02:50

score 0 · Answer 4 · edited Jul 24 '20 at 23:11

You can use ifelse to change the numeric values to strings.

BU <- c("Unit1", "Unit2", "Unit3", "Unit4", "Unit5")
Audit <- c(2,3,2.5,3,2)
CORC <- c(0,1.4,2.4,3.2,2.8)
GOV <- c(2,3.2,2.8,1.6,2)

df <- data.frame(BU, Audit, CORC, GOV)
df$BU <- as.character(df$BU)
df$Audit <- as.numeric(as.character(df$Audit))
df$CORC <- as.numeric(as.character(df$CORC))
df$GOV <- as.numeric(as.character(df$GOV))

df[,-1] <- ifelse(df[,-1]>3.499, "Ineffective",
                  ifelse(df[,-1]>2.499 & df[,-1]<=3.499, "Marginally Effective",
                         ifelse(df[,-1]>1.5 & df[,-1]<=2.499, "Generally Effective",
                                "Highly Effective")))

> df
     BU                Audit                 CORC                  GOV
1 Unit1  Generally Effective     Highly Effective  Generally Effective
2 Unit2 Marginally Effective     Highly Effective Marginally Effective
3 Unit3 Marginally Effective  Generally Effective Marginally Effective
4 Unit4 Marginally Effective Marginally Effective  Generally Effective
5 Unit5  Generally Effective Marginally Effective  Generally Effective

If you want to collapse the whole table, you can add this code:

df[2:(NROW(df)+1),] <- df[1:NROW(df),]
df[1,] <- colnames(df)

new_df <- apply( df, 1 , paste , collapse = "," )

Output:

> new_df
                                                                    1 
                                                  "BU,Audit,CORC,GOV" 
                                                                    2 
     "Unit1,Generally Effective,Highly Effective,Generally Effective" 
                                                                    3 
   "Unit2,Marginally Effective,Highly Effective,Marginally Effective" 
                                                                    4 
"Unit3,Marginally Effective,Generally Effective,Marginally Effective" 
                                                                    5 
"Unit4,Marginally Effective,Marginally Effective,Generally Effective" 
                                                                    6 
 "Unit5,Generally Effective,Marginally Effective,Generally Effective"

Convert a data frame of numerics to string equivalents based on a reference range

4 Answers4