0

I am a beginner in R. I have an assignment right now, which is to do the data cleaning for a data set. This data set has more than 10,000 rows. My job is too analyze the accuracy for each participant and drop the participant who has low accuracy. Also, each participant answered 200 questions in this data set. In addition, there has a column for the accuracy. In that column, "1" means right and "0" means wrong.


The sample of the dataset


It is how the data set looks like. There has more than 100 participants in this data set. I don't know which loop that I can use for it. Here is what I got so far. If I don't use a loop to do it, then I will do it a least 100 time....

participant1 = dataset_name[dataset_name$Participant_ID == 1,] 
mean(participant1$Participant_accuracy)
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • 1
    Does the assignment require you to use a loop? If not, you shouldn't use a loop for this type of task in R. – SymbolixAU Oct 04 '16 at 23:48
  • 4
    And, rather than posting an image or your data, you should edit your question with the output of `dput(head(df))` (where `df` is the name of your data). See [how to make a great reproducible example](http://stackoverflow.com/q/5963269/5977215) – SymbolixAU Oct 04 '16 at 23:49
  • `tapply(dataset_name$Participant_accuracy, dataset_name$Participant_ID, FUN=mean)` – jogo Oct 05 '16 at 07:23

1 Answers1

0

I've generated some dummy data to help you along. As @SymbolixAU noted, it probably isn't necessary to use a for loop. We can use the aggregate and which functions, or we can use the dplyr package.

generate dummy data

I first create a data set that has a column for ID and a column for an Accuracy indicator. The probability of any row being accurate is 0.8.

set.seed(123)
df1 <- data.frame(ID = rep(1:10, each = 20),
                  Accuracy = rbinom(200, 1, prob = .8))

calculation

Then, we calculate the mean of the Accuracy column for each ID using the aggregate function.

df1.sum <- aggregate(Accuracy ~ ID, FUN = mean, data = df1)

#    ID Accuracy
# 1   1     0.70
# 2   2     0.80
# 3   3     0.90
# 4   4     0.85
# 5   5     0.85
# 6   6     0.70
# 7   7     0.80
# 8   8     0.90
# 9   9     0.90
# 10 10     0.75

use calculation to subset the data

Using this result, we can select the IDs that pass (i.e. have Accuracy >= 80%). We can use this list of IDs to subset our data

pass_ids <- df1.sum[which(df1.sum$Accuracy >= .8), 1]
df1_pass <- df1[df1$ID %in% pass_ids, ]

dplyr

Alternatively, we can use the dplyr package. library(dplyr)

df1_pass2 <- df1 %>%
    group_by(ID) %>%
    filter(mean(Accuracy) >= 0.8)
bouncyball
  • 10,631
  • 19
  • 31