0

I have a data.table consisting of anonymised final exam results for students. Each row is a student, each column a subject.

If a student took a subject they have a their resulting grade, and otherwise NA if they did not. I want to find the total number of subjects each student took.

The dt comes with this (num_subjects), but it's not coded to account for some subjects allowing study at 'double' intensity. Double intensity is denoted by having 2 grades recorded rather than just 1.

So I need to write a code that looks for instances of double grades like AA or AB or CC, and where found, adds +1 to a new variable for the number of subjects taken.

Example data: note there are actually 30+ potential subjects

subject1  subject2  subject3  num_subjects
AA        NA        NA        1
B         BB        C         3
NA        NA        A         1
NA        CC        D         2

Desired output

subject1  subject2  subject3  num_subjects  new_num_subjects
AA        NA        NA        1             2
B         BB        C         3             4
NA        NA        A         1             1
NA        CC        D         2             3

To do this for one subject, I'd just do:

dt[, new_num_subjects:= num_subjects] 
dt[
(subject1=="AA" | 
subject1=="AB" | 
subject1=="BB" | 
subject1=="BC" | 
subject1=="CC" | 
), new_num_subjects:= num_subjects+1] 

This works. But how to avoid writing this for all 30+ subjects?

Have tried making a vector of all subject names (subjectvec) and looping through it (see below), but this does not work:

for(i in subjectvec)
dt[
( (i) =="AA" | 
(i) =="AB" | 
(i) =="BB" | 
(i) =="BC" | 
(i) =="CC" | 
), new_num_subjects:= num_subjects+1] 
Samvel Aleqsanyan
  • 2,812
  • 4
  • 20
  • 28
newruser
  • 3
  • 1
  • 1
    If you format the data in long form, then `num_subjects = sum(nchar(grade))` should work. It would be easier to explain with reproducible data: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 – Frank Apr 19 '18 at 20:05
  • @Frank I think user wants something along the lines of `apply(xy[, -4], MARGIN = 1, FUN = function(x) sum(sapply(x, nchar), na.rm = TRUE))`. – Roman Luštrik Apr 19 '18 at 20:17
  • How do you put these commands into data.table syntax though? I tried: `dt[,charnum:= lapply(.SD, sum(nchar(subjectvec))), .SDcols = subjectvec]` But this returns: `'sum(nchar(post2014gcse))' is not a function, character or symbol` Any ideas to correct this? – newruser Apr 20 '18 at 16:33

3 Answers3

0

Here you go. You can define what columns to use in your calculation using [,1:3,5] as an example.

subject1 <- c("AA","B",NA,NA)
subject2 <- c(NA,"BB",NA,"CC")
subject3 <- c(NA,"C","A","D")
num_subjects <- c(1,3,1,2)
grades<-data.frame(subject1,subject2,subject3,num_subjects, stringsAsFactors = FALSE)

grades$new_num_subjects <-
  sapply(1:nrow(grades), function(x)
    sum(nchar(grades[x,1:3]),na.rm = TRUE)
    )

To see output

grades

  subject1 subject2 subject3 num_subjects new_num_subjects
1       AA     <NA>     <NA>            1                2
2        B       BB        C            3                4
3     <NA>     <NA>        A            1                1
4     <NA>       CC        D            2                3

Try this one to see if it is faster based on example by Roman Luštrik. lapply takes a list and I am not sure what data type those variables names you are using have. We answered your question based on the example provided. However if you provide more data, it will be easier for us to help

grades$new_num_subjects <- 
  apply(grades[, -4], MARGIN = 1, FUN = function(x) sum(sapply(x, nchar), na.rm = TRUE))
Kill3rbee Lee Mtoti
  • 246
  • 1
  • 2
  • 11
  • Thanks @Lee S. Is there an efficient way to do this in data.table? In my real data-set I've got >700k observations and >500 variables. It seems to be taking a long time to compute (still running, has been several minutes). – newruser Apr 20 '18 at 10:24
  • Actually, the command suggested doesn't seem to work. Thinking about it further, perhaps it is easiest to just create a variable that takes the value of the total number of characters across all subject grade variables. I thought this would look like as per below, but this doesn't work. `dt[,charnum:= lapply(.SD, sum(nchar(subjectvec))), .SDcols = subjectvec]` returns: `'sum(nchar(post2014gcse))' is not a function, character or symbol` Any ideas to correct this? – newruser Apr 20 '18 at 16:29
  • I know how to convert every grade into its number of characters: `dt[,(subjectvec):= lapply(.SD, nchar), .SDcols = subjectvec]` But am lost on how to instead just create a single additional column that takes the sum of all the character numbers across the `subjectvec` vector. – newruser Apr 20 '18 at 16:54
0

I run this for a very large simulated dataframe (1M rows) and it runs very quickly (despite not being a data.table approach):

# Dummy data
subject <- c("A", "AA", "B", "BB", "C", "CC", NA)
n <- 1000000
grades <- data.frame("s1" = sample(subject, n, replace = T), "s2" = sample(subject, n, replace = T), "s3" = sample(subject, n, replace = T))

# Count subjects
grades$new <- nchar(gsub("NA", "", paste0(grades[,1], grades[,2], grades[,3])))
rg255
  • 4,119
  • 3
  • 22
  • 40
0

Using data from Lee S, here are 2 data.table-based coding options

subject1 <- c("AA","B",NA,NA)
subject2 <- c(NA,"BB",NA,"CC")
subject3 <- c(NA,"C","A","D")
num_subjects <- c(1,3,1,2)

grades<-data.table(subject1,subject2,subject3,num_subjects)

## first option

# 1 paste subject1:subject3 together (but NA becomes character "NA" in paste'ing) with do.call
# 2 change all of those "NA"s to "" with gsub
# 3 find nchar

grades[,num_new := nchar(gsub("NA","",do.call(paste0,.SD))), .SDcols=subject1:subject3]

grades

#   subject1 subject2 subject3 num_subjects num_new
#1:       AA       NA       NA            1       2
#2:        B       BB        C            3       4
#3:       NA       NA        A            1       1
#4:       NA       CC        D            2       3


## Second option
rowSums_na.rm <- function(x) rowSums(x, na.rm = TRUE)
grades[,num_new := rowSums_na.rm(as.data.table(lapply(.SD, nchar))),  
           .SDcols=subject1:subject3]
JoeV
  • 16
  • 2