1

I have seen answers given here but I was wondering if in my case there might be an efficient way in BASE R to change my input data.frame to my desired output data.frame shown below?

input <- data.frame(id = c(1,3,6), school = LETTERS[1:3], read_1 = c(20,22,24),
               read_1_sp = c(T,F,T), read_2 =c(45,47,49),read_2_sp = c(F,F,F),  
               math_1 =c(20,22,NA), math_1_sp = c(T,F,NA), math_2 = c(NA,35,37),
               math_2_sp =c(NA,F,F))


output <- data.frame(id = c(rep(1,3),rep(3,4), rep(6, 3)),school = c(rep("A",3),rep("B",4), rep("C", 3)),   
                      subject = c("read","read","math","read","read","math", "math","read","read","math"),  
                      no.= c(1,2,1,1,2,1,2,1,2,2), score = c(20,45,20,22,47,22,35,24,49,37),    
                      sp = c(T,F,T,T,F,T,T,T,F,T))
rnorouzian
  • 7,397
  • 5
  • 27
  • 72
  • Have you tried `stats::reshape`? It's not the most intuitive to work with, but it does this reshaping you seek. – r2evans Apr 16 '20 at 20:29
  • Since you're trying to narrow several columns, I suspect it'll be several independent calls to `reshape` and then `merge`ing them together. – r2evans Apr 16 '20 at 20:31
  • IMO, this is the strong impetus for `tidyr::pivot_longer` ... while I appreciate your incentive to stick with base R, there are reasons convenience functions are such a ... convenience :-) – r2evans Apr 16 '20 at 20:34
  • Have you considered `sqldf` or some other sql-based solution? It's counter to your base-R constraint, but if you already use one of `sqldf`, `DBI`, `RSQLite`, or some other DBMS elsewhere in your workflow, you can capitalize on using that. – r2evans Apr 16 '20 at 20:57

2 Answers2

2

1) Base - reshape Create a list, varying, of two elements each of which is a character vector of names -- the first element is the vector of score names and the second is the vector of sp names. Use that with base reshape. Then sort by the idvar variables (or omit the two lines which perform the sorting if this is not required) and remove rows with NAs using na.omit. reshape produces a subject column with entries such as read_1. The transform statement splits that into two columns, subject and no.

varying <- lapply(c("\\d$", "sp$"), grep, names(input), value = TRUE)

r <- reshape(input, dir = "long", idvar = c("id", "school"), 
  varying = varying, v.names = c("score", "sp"),
  times = varying[[1]], timevar = "subject")  

o <- with(r, order(id, school))
r <- r[o, ]
r <- na.omit(r)

transform(r, subject = sub("_.*", "", subject), no = as.numeric(sub(".*_", "", subject)))

giving:

           id school subject score    sp no
1.A.read_1  1      A    read    20  TRUE  1
1.A.read_2  1      A    read    45 FALSE  2
1.A.math_1  1      A    math    20  TRUE  1
3.B.read_1  3      B    read    22 FALSE  1
3.B.read_2  3      B    read    47 FALSE  2
3.B.math_1  3      B    math    22 FALSE  1
3.B.math_2  3      B    math    35 FALSE  2
6.C.read_1  6      C    read    24  TRUE  1
6.C.read_2  6      C    read    49 FALSE  2
6.C.math_2  6      C    math    37 FALSE  2

2) data.table - melt The question asked for a base solution but just for comparison we also show a solution using melt in data.table.

Convert input to a data.table and melt it using the key and the indicated patterns. melt has no counterpart to the times= in reshape but rather provides index numbers in the variable.name column which is subject in this case. We use it to index into the times. This produces elements such as read_1 so we use fread to split subject into two columns, subject and no. Finally remove the rows having NAs using na.omit and sort by specifying key.

library(data.table)

input2 <- as.data.table(input, key = c("id", "school"))
times <- grep("\\d$", names(input2), value = TRUE)  # score col names

melt(input2, measure = patterns(sp = "sp", score = "\\d$"), variable.name = "subject")[, 
  c("subject", "no"):= fread(text = times[subject], sep = "_")][, 
  na.omit(.SD), key = key(input2)]

giving:

    id school    sp score subject no
 1:  1      A  TRUE    20    read  1
 2:  1      A FALSE    45    read  2
 3:  1      A  TRUE    20    math  1
 4:  3      B FALSE    22    read  1
 5:  3      B FALSE    47    read  2
 6:  3      B FALSE    22    math  1
 7:  3      B FALSE    35    math  2
 8:  6      C  TRUE    24    read  1
 9:  6      C FALSE    49    read  2
10:  6      C FALSE    37    math  2
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
1

In tidyverse, we can do

library(tidyr)
library(stringr)
input %>%
   rename_at(vars(matches('\\d+$')), ~ str_c(., '_score')) %>%
   pivot_longer(cols = -c(id, school), names_to = c('subject', 'no', '.value'),
          names_sep="_", values_drop_na = TRUE)
# A tibble: 10 x 6
#      id school subject no    score sp   
#   <dbl> <fct>  <chr>   <chr> <dbl> <lgl>
# 1     1 A      read    1        20 TRUE 
# 2     1 A      read    2        45 FALSE
# 3     1 A      math    1        20 TRUE 
# 4     3 B      read    1        22 FALSE
# 5     3 B      read    2        47 FALSE
# 6     3 B      math    1        22 FALSE
# 7     3 B      math    2        35 FALSE
# 8     6 C      read    1        24 TRUE 
# 9     6 C      read    2        49 FALSE
#10     6 C      math    2        37 FALSE

Or an option with base R

 i1 <-grep("_\\d+$", names(input))
 names(input)[i1] <- paste0(names(input)[i1], "_score")
 lst1 <- lapply(split.default(input[-(1:2)], sub(".*_", "", names(input)[-(1:2)])), function(dat) {
         dat1 <- stack(dat)
          transform(dat1, no = sub("^[^_]+_(\\d+)_.*", "\\1", ind), ind = sub("_.*", "", ind))})
out <- setNames(cbind(lst1[[1]], lst1[[2]][1])[c(2:3, 1, 4)], c('subject', 'no', 'score', 'sp'))
na.omit(cbind(input[rep(seq_len(nrow(input)), each = 4), 1:2], out))
akrun
  • 874,273
  • 37
  • 540
  • 662