1

Students are being tested on multiple choice questions (MCQ) by using Zoom polling. There are 40 students in the class, and there are 30 MCQs. I don't really know how the polling works, but the course professor told me that Zoom will produce 30 excel csv files - one for each MCQ. Each file will have the name of the student, their ID, and either 1 or 0 points for the MCQ answer.

My job is to find a way to take all these 30 csv files (which I have kept in a folder), and merge them to find the total score for each student. So, I should be able to tell "How much did Kate score out of 30?".

Each csv file (which will be named MCQ1, MCQ2, ...) will look something like this:

Name   SID   Score
Kate   534     1
Bret   411     0
Jade   325     1

I have to do this using R. This is what I need:

Name   SID    MCQ1  MCQ2  MCQ3  MCQ4  MCQ5  ...  Total
Kate   534     1     1      0     1    0    ...  (=sum) 
Bret   411     0     0      0     1    1    ...  (=sum)
Jade   325     1     0      1     1    0    ...  (=sum)

What needs to happen is R needs to find the MCQ scores for each student using "Name" or "SID" and stack only the scores one after another.

I have looked at the following posts here in SO.

  1. Importing multiple Excel files with filenames in R
  2. How can I read multiple (excel) files into R? [duplicate]
  3. How to import multiple .csv files at once?

Here is what I have so far:

total_score <- list.files(pattern = "*.csv") %>% map_df(read_csv)

Name   SID   Score
Kate   534     1
Bret   411     0
Jade   325     1
Kate   534     1
Bret   411     0
Jade   325     0
Kate   534     0
Bret   411     0
Jade   325     1
Kate   534     1
Bret   411     1
Jade   325     1
Kate   534     0
Bret   411     1
Jade   325     0

How do I tell R not to repeat the first two columns, and to only take each "score" column from the files for each specific student and then put them as new columns named after the csv files? Can someone point me in the right direction?

Pineapple
  • 193
  • 8

2 Answers2

1

We can rename the 'Score' column to the filename by extracting the filename with basename, then loop over the 'files', read the data with read_csv, rename the 'Score' with the filename ('nm1'), reduce it to a single dataset by joining the datasets by 'Name', 'SID' and then create the 'Total' column by doing the rowSums of columns that start with 'MCQ'

library(readr)
library(dplyr)
library(purrr)
files <- list.files(pattern = "*.csv")
nm1 <- tools::file_path_sans_ext(basename(files))
imap(setNames(files, nm1), 
        ~ {
          nm <- .y
          read_csv(.x) %>%
                  rename_at(vars(Score), ~ nm)
    }) %>%
   reduce(full_join, by = c('Name', 'SID')) %>%
   mutate(Total = rowSums(select(., starts_with('MCQ'))))

-output

# A tibble: 3 x 8
#  Name    SID  MCQ1  MCQ2  MCQ3  MCQ4  MCQ5 Total
#  <chr> <int> <int> <int> <int> <int> <int> <int>
#1 Bret    411     0     0     0     1     1     2
#2 Jade    325     1     0     1     1     0     3
#3 Kate    534     1     1     0     1     0     3

Or an option with data.table

library(data.table)
Reduce(function(x, y) x[y, on = .(Name, SID)], 
  Map(function(x, y) setnames(fread(x), 'Score', y),
       files, nm1))[, Total := rowSums(.SD, na.rm = TRUE), 
             .SDcols = patterns("^MCQ")][]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Woooh, this worked so well. Thanks akrun! I have one question though. Let's say the names of the students are not in the same order in every csv file (for example, Bret in MCQ1 is in the first row, and in MCQ2 in the 17th row). Will your code still be valid then? Or is some change needed? – Pineapple Sep 26 '20 at 00:39
  • 1
    @AIQ The order wouldn't matter as we are doing a join that matches the 'Name' and "SID" in all the datasets – akrun Sep 26 '20 at 21:18
0

Try this base R option with a built-in function to load and format names. The main idea of the function is to load the files based on the list of files then assign the name of the file to the score variable. After that using lapply(), Reduce() and merge() we compact the data in order to obtain the total variable with rowSums() (we exclude the first two variables as they are name and number of student). I have tested with some dummy files based on the output you shared. Here the code:

#Obtain files
v1 <- list.files(pattern = '*.csv')
#Function for load
myload <- function(x)
{
  #Name for var
  vname <- gsub('.csv','',x)
  #Load data
  y <- read.csv(x,stringsAsFactors = F)
  #Rename the score var
  names(y)[3] <- vname
  #Return
  return(y)
}
#Apply
List <- lapply(v1,myload)
#Now merge
df <- Reduce(function(x, y) merge(x, y, by=c('Name','SID'),all=TRUE), List)
#Compute total
df$Total <- rowSums(df[,-c(1,2)],na.rm=T)

Output:

  Name SID MCQ1 MCQ2 MCQ3 MCQ4 MCQ5 Total
1 Bret 411    0    0    0    1    1     2
2 Jade 325    1    0    1    1    0     3
3 Kate 534    1    1    0    1    0     3
Duck
  • 39,058
  • 13
  • 42
  • 84
  • Hello Duck, thanks for the answer. The code at the step "#Now merge" gives an error: *Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column*. Any ideas? – Pineapple Sep 26 '20 at 00:36
  • @AIQ Maybe one of your csv files does not have the required variables! – Duck Sep 26 '20 at 00:42
  • I checked, my csv files all have 3 columns and 6 rows (including the header), they all match. – Pineapple Sep 26 '20 at 00:56
  • @AIQ The only thing I think is that some name can be different and that is producing the error. – Duck Sep 26 '20 at 00:58
  • @AIQ Also try adding `stringsAsFactors=F` in the `read.csv()` that is located inside the reading function or explore if all names in List are equal using `lapply(List,names)` – Duck Sep 26 '20 at 01:03