7

I am working on a research project, and one of the tables is entered in a way that is not quite suitable for analysis yet, so I am trying to reorganize it. Currently, each row is a test-taker, and each column is a question they have answered wrongly, entered in ascending order. So, for the first row, entries may read "Q1" "Q3" "Q9" etc. for the first, second and third columns respectively. There are 25 questions in total.

My goal is to reorganize the data such that there is a column for every question. If a test-taker has answered the question correctly, the entry for the respective column reads 1, and 0 if otherwise.

There is a brute-force way that appears to work. It is possible to mutate each column separately and check for each value in each column. However, there are 25 questions, and all this typing seems extremely inefficient, so I suspect there must be a better way.

The brute-force code looks something like:

df %>%
  mutate(Q3 == ifelse(col1 == "Q3" | col2 == "Q3" | col3 == "Q3", 0, 1))

Here, col1, col2, col3 are all columns that might contain Q3, which might be a question the test-taker got wrong. If any of them do, we enter 0. Else, we enter 1.

With 25 questions, the code becomes too long.

EDIT: A sample of the dataframe looks like this.

sample <- "ID   Col1  Col2  Col3  Col4
1          100   Q1     
2          101   Q3    Q4
3          102   Q2    Q3    Q4   
4          103   
5          104   Q4
6          105   Q1    Q2    Q3    Q4 "

The desired output is as follows:

sample <- "ID    Q1    Q2    Q3    Q4
1          100   0     1     1     1
2          101   1     1     0     0
3          102   1     0     0     0   
4          103   1     1     1     1
5          104   1     1     1     0 
6          105   0     0     0     0 "
M--
  • 25,431
  • 8
  • 61
  • 93
  • 1
    Can you share a sample of your input data (not all columns and not all rows) and a desired output based on that. Read this thread to know more about asking questions with reproducible examples: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – M-- Sep 18 '19 at 15:41

5 Answers5

3

this was my solution- turning the data from wide to long and back again

s <- reshape2::melt(sample, id.vars = "ID", value.name = "Q")
s$variable <- 1
s <- subset(s, complete.cases(s))
s <- reshape(s, idvar = "ID", timevar = "Q", direction = "wide")
s <- apply(s, 2, function(x) ifelse(is.na(x), 0, x))
m070ch
  • 56
  • 3
2

1) Assuming DF is as shown reproducibly in the Note at the end use sapply to create an matrix of indicators and then cbind it to the ID column. Finally make the names nicer. No packages are used.

ques <- function(i) paste0("Q", 1:25) %in% unlist(DF[i, -1])
DFout <- cbind(DF[1], +t(sapply(1:nrow(DF), ques)))
names(DFout)[-1] <- paste0("Q", names(DFout[-1]))

The first 5 columns are:

> DFout[1:5]

   ID Q1 Q2 Q3 Q4
1 100  1  0  0  0
2 101  0  0  1  1
3 102  0  1  1  1
4 103  0  0  0  0
5 104  0  0  0  1
6 105  1  1  1  1

2) Another possibility is to convert the input to long form and then use xtabs to create a table from it.

library(dplyr)
library(tidyr)

tab <- DF %>% 
  gather(key, Question, -ID) %>%
  filter(nzchar(Question)) %>%
  mutate(Question = factor(Question, paste0("Q", 1:25))) %>%
  xtabs(~ ID + Question, .)

giving this table. We show the first 5 columns:

> tab[, 1:5]

     Question
ID    Q1 Q2 Q3 Q4
  100  1  0  0  0
  101  0  0  1  1
  102  0  1  1  1
  104  0  0  0  1
  105  1  1  1  1

If it is important that the result be a data frame then add:

library(tibble)

tab %>% 
  as.data.frame.matrix %>% 
  rownames_to_column(var = "ID")    

Note

sample <- "rows ID   Col1  Col2  Col3  Col4
1          100   Q1     
2          101   Q3    Q4
3          102   Q2    Q3    Q4   
4          103   
5          104   Q4
6          105   Q1    Q2    Q3    Q4"
DF <- read.table(text = sample, header = TRUE, fill = TRUE, as.is = TRUE,
  strip.white = TRUE)[-1]
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
2

@G.Grothendieck provides a really nice solution. Here's a variation on that answer that will produce a value for every question in the test, even if it was answered correctly by every student. Admittedly, it's a little less elegant. Also note that I construct the data with missing values instead of empty strings, so the filter is a little different

dat <- data.frame(ID = c(100:105), 
                  Col1 = c("Q1", "Q3", "Q2", NA, "Q4", "Q1"), 
                  Col2 = c(NA, "Q4", "Q3", NA, NA, "Q2"), 
                  Col3 = c(NA, NA, "Q4", NA, NA, "Q3"), 
                  Col4 = c(NA, NA, NA, NA, NA, "Q4"), 
                  stringsAsFactors = FALSE)

dat %>%
  gather(key = col, val = wrong, -ID) %>%
  select(-col) %>%
  mutate(tmp = 1) %>%
  complete(wrong = paste0("Q", 1:25)) %>%
  filter(!is.na(wrong)) %>%
  spread(wrong, tmp, fill = 0) %>%
  select(ID, paste0("Q", 1:25)) %>%
  filter(!is.na(ID)) %>%
  data.frame
Brendan A.
  • 1,268
  • 11
  • 16
2

Here's a convert-to-long-first approach similar to the others, but with data.table

library(data.table)
setDT(df)

dcast(melt(df, 'ID'), ID ~ value, fun.aggregate = length)[, V1 := NULL][]
#     ID Q1 Q2 Q3 Q4
# 1: 100  1  0  0  0
# 2: 101  0  0  1  1
# 3: 102  0  1  1  1
# 4: 104  0  0  0  1
# 5: 105  1  1  1  1
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
0

This would be an approach using a simple for loop.

Let's take this data from above:

sample <- "rows ID   Col1  Col2  Col3  Col4
1          100   Q1     
2          101   Q3    Q4
3          102   Q2    Q3    Q4   
4          103   
5          104   Q4
6          105   Q1    Q2    Q3    Q4"
DF <- read.table(text = sample, header = TRUE, fill = TRUE, as.is = TRUE,
                 strip.white = TRUE)[-1]

Here is the approach. It fills the existing data frame to double-check whether recoding went well:

vars <- paste0("Q", 1:4)

for (i in vars){
  DF[i] = rowSums(ifelse(DF[, grep( "Col", names(DF))]==i, 1, 0))
}  

Afterwards unneeded columns can be delete:

DF <- DF[, -grep( "Col", names(DF))]
TimTeaFan
  • 17,549
  • 4
  • 18
  • 39