0

I'm quite a newbie in R so I was interested in the optimality of my solution. Even if it works it could be (a bit) long and I wanted your advice to see if the "way I solved it" is "the best" and it could help me to learn new techniques and functions in R.

I have a dataset on students identified by their id and I have the school where they are matched and the score they obtained at a specific test (so for short: 3 variables id,match and score).

I need to construct the following table: for students in between two percentiles of score, I need to calculate the average score (between students) of the average score of the students of the school they are matched to (so for each school I take the average score of the students matched to it and then I calculate the average of this average for percentile classes, yes average of a school could appear twice in this calculation). In English it allows me to answer: "A student belonging to the x-th percentile in terms of score will be in average matched to a school with this average quality".

Here is an example in the picture:

Example

So in that case, if I take the median (15) for the split (rather than percentiles) I would like to obtain:

[0,15]  : 9.5

(15,24] : 20.25

So for students having a score between 0 and 15 I take the average of the average score of the school they are matched to (note that b average will appears twice but that's ok).

Here how I did it:

match <- c(a,b,a,b,c)
score <- c(18,4,15,8,24)
scoreQuant <- cut(score,quantile(score,probs=seq(0,1,0.1),na.rm=TRUE))
AvgeSchScore <- tapply(score,match,mean,na.rm=TRUE)
AvgScore <- 0
for(i in 1:length(score)) {
       AvgScore[i] <- AvgeSchScore[match[i]]
}
results <- tapply(AvgScore,scoreQuant,mean,na.rm = TRUE)

If you have a more direct way of doing it.. Or I think the bad point is 3) using a loop, maybe apply() is better ? But I'm not sure how to use it here (I tried to code my own function but it crashed so I "bruted force it").

Thanks :)

989
  • 12,579
  • 5
  • 31
  • 53
Yurienu
  • 95
  • 7
  • Your "Here's how I did it" section should be actual code examples. You can easily use the example with the picture you linked to. Writing it out puts the responsibility on the reader to decipher and mind-read what you are talking about. If you need help asking the question read http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Pierre L Oct 07 '15 at 22:25
  • why is "Match" capitalized in the loop? – Pierre L Oct 07 '15 at 22:39
  • I'm really sorry. I just edited everything correctly I think. Sorry again – Yurienu Oct 07 '15 at 22:40
  • I re-edited it, you're perfectly right. The loop takes... time. – Yurienu Oct 07 '15 at 22:41
  • Your loop can be replaced with `AvgScore <- AvgeSchScore[match]` – Pierre L Oct 07 '15 at 22:42
  • Hmm.. Are you sure ? AvgeSchScore and Match do not have the same size since several students may be matched to the same school. I want AvgScore to give for each student the average of the school he is matched to (so all the students in the same school will have the same value for this variable) – Yurienu Oct 07 '15 at 22:46
  • Compare the two outputs and verify that they are the same. – Pierre L Oct 07 '15 at 22:48
  • You're right my apologies.. So I'll meditate on what happens in the code :) I thought that dummies or coordinates must have the same size than the original matrix. – Yurienu Oct 07 '15 at 22:54
  • No you are subsetting by names. The tapply function assigns names to the output. You are using those names to match the column. I am writing a data.table solution that might speed things up. – Pierre L Oct 07 '15 at 22:59
  • I see ! Thanks ! It's already super quick on the original dataset but I would be very interested to have an even more efficient version to learn the technique ! – Yurienu Oct 07 '15 at 23:02

1 Answers1

0

The main fix is to eliminate the for loop with:

AvgScore <- AvgeSchScore[match]

R allows you to subset in ways that you cannot in other languages. The tapply function outputs the names of the factor that you grouped by. We are using those names for match to subset AvgeScore.

data.table

If you would like to try data.table you may see speed improvements.

library(data.table)
match <- c("a","b","a","b","c")
score <- c(18,4,15,8,24)
dt <- data.table(id=1:5, match, score)
scoreQuant <- cut(dt$score,quantile(dt$score,probs=seq(0,1,0.1),na.rm=TRUE))
dt[, AvgeScore := mean(score), match][, mean(AvgeScore), scoreQuant]
#    scoreQuant   V1
#1: (17.4,19.2] 16.5
#2:          NA  6.0
#3:   (12.2,15] 16.5
#4:   (7.2,9.4]  6.0
#5:   (21.6,24] 24.0

It may be faster than base R. If the value in the NA row bothers you, you can delete it after.

Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • I tried the data.table. And I get.. only NA. I typed what you proposed (changing the names corresponding to the ones of the original database). So I converted my data.frame into data.table, it worked well. The quantile part also. Then if I type: `dt[, AvgeScore := mean(score), match]` I got a lot of NA but I don't see why 'cause I checked the score of the students for a school with a NA and they are not NA. By the way my match variable has the form "55F999" (2 numbers, one letter, 3 numbers) – Yurienu Oct 08 '15 at 15:14
  • Could you post the first 5 or 6 lines from your real data and I will try it? Add it to your question. – Pierre L Oct 08 '15 at 15:17
  • I tried with the head() of my original databse but the code works well (since for the 6 first or last usually there is only one student per school). But I figured out what happened: the averages that are NA are those for which at least one student as NA for the score (so NA seems to "propagate" in the calculations) so I added na.rm= TRUE in the mean() and it worked. Sorry for this newbie mistake :) – Yurienu Oct 08 '15 at 15:36