1

I have a multilevel structure, and what I need to do is standardize for each individual (which is the higher level unit, each having several separate measures).

Consider:

  ID measure score
1  1       1     5
2  1       2     7
3  1       3     3
4  2       1    10
5  2       2     5
6  2       3     3
7  3       1     4
8  3       2     1
9  3       3     1

I used apply(data, 2, scale) to standardize for everyone (this also standardizes the ID and measure, but that is alright).

However, how do I make sure to standardize seperately for ID == 1, ID == 2 and ID == 3? --> Each observation - mean of 3 scores, divided by standard deviation for 3 scores).

I was considering a for loop, but the problem is that I want to bootstrap this (in other words, replicate the whole procedure a 1000 times for a big dataset, so speed is VERY important).

Extra information: the ID's can have variable measurements, so it is not the case that they all have 3 measured scores.

The dput of the data is:

structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), measure = c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), score = c(5L, 7L, 3L, 10L, 5L, 
3L, 4L, 1L, 1L)), .Names = c("ID", "measure", "score"), class = "data.frame", row.names = c(NA, 
-9L))
Siguza
  • 21,155
  • 6
  • 52
  • 89
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • Look at package `plyr` (function `ddply`). – Roland Apr 15 '13 at 11:10
  • 1
    Please give sample data or reproducible example so that good people here can help you better. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – CHP Apr 15 '13 at 11:13
  • 1
    @geektrader This sample data suffices in my opinion. – PascalVKooten Apr 15 '13 at 11:14
  • @Dualinity Sufficient or not, it is a good practice to give your data in a form which can be easily pasted into R locally. Currently one needs to retype your data from scratch instead of pasting. It is easier when you do this: mydata <- data.frame(x1=..., x2=...) – Maxim.K Apr 15 '13 at 11:23
  • 1
    I think the piece of data illustrates good enough the problem, I edited the question according to the previous comments. – Jilber Urbina Apr 15 '13 at 11:27
  • @Jilber, I don't think it's unfair to put the burden on the OP to provide a *reproducible example*. I've removed my down-vote after your edit. – Arun Apr 15 '13 at 11:29
  • @MaximKovalenko I agree with you. Next time I'll supply the code to be able to reproduce it. – PascalVKooten Apr 15 '13 at 11:53

1 Answers1

3

Here's an lapply with split solution and assuming your data is DF

> lapply(split(DF[,-1], DF[,1]), function(x) apply(x, 2, scale))
$`1`
     measure score
[1,]      -1     0
[2,]       0     1
[3,]       1    -1

$`2`
     measure      score
[1,]      -1  1.1094004
[2,]       0 -0.2773501
[3,]       1 -0.8320503

$`3`
     measure      score
[1,]      -1  1.1547005
[2,]       0 -0.5773503
[3,]       1 -0.5773503

An alternative which produces the same result is:

> simplify2array(lapply(split(DF[,-1], DF[,1]), scale))

This alternative avoids using apply inside lapply call.

Here's split divides the data into groups defined by ID and it returns a list, so you can use lapply to loop over each element of the list applying scale.

Using ddply from plyr as @Roland suggests:

> library(plyr)
> ddply(DF, .(ID), numcolwise(scale))
  ID measure      score
1  1      -1  0.0000000
2  1       0  1.0000000
3  1       1 -1.0000000
4  2      -1  1.1094004
5  2       0 -0.2773501
6  2       1 -0.8320503
7  3      -1  1.1547005
8  3       0 -0.5773503
9  3       1 -0.5773503

Importing your data (this is to answer the last comment)

DF <- read.table(text="  ID measure score
1  1       1     5
2  1       2     7
3  1       3     3
4  2       1    10
5  2       2     5
6  2       3     3
7  3       1     4
8  3       2     1
9  3       3     1", header=TRUE)
Jilber Urbina
  • 58,147
  • 10
  • 114
  • 138
  • By the way, thanks for showing how easy it is to import it like that! – PascalVKooten Apr 15 '13 at 11:18
  • If you add a little bit of explanation to it, I'll accept and upvote the answer (I'm not really sure yet what both functions do, while I'd be able to read it, I guess it improves the answer quality) – PascalVKooten Apr 15 '13 at 11:19
  • Btw, check the first part. The ddply solution seems good. For first ID: `scale(c(5,7,3))` -> `(0, 1, -1)`, but I cannot find these three values in that order in the first solution? – PascalVKooten Apr 15 '13 at 11:25
  • @Dualinity you're absolutely right, I edited the answer giving some explanation and fixing the mistake. – Jilber Urbina Apr 15 '13 at 11:29
  • Btw as a benchmark, the first solution is around 2 times faster (based on 1000 repetitions). Thank you. Still, the ordering is not good yet in that solution? – PascalVKooten Apr 15 '13 at 11:31
  • 1
    If speed matters and your dataset is big or you have many IDs, `data.table` might be the way to go. – Roland Apr 15 '13 at 11:35
  • It roughly takes 1 second for running ddply once on the dataset. I guess that'll have to be acceptable. – PascalVKooten Apr 15 '13 at 11:47
  • @Jilber The lapply solution doesn't really yield a ready-to-go object. I was not able to get it to show up in the way `ddply` has it. I am supposing that transferring that solution into matrix format will at least take as much time as the ddply solution. – PascalVKooten Apr 15 '13 at 11:48
  • @Jilber how again did you do the amazing copy paste import for which you just copied the text as a string? – PascalVKooten Apr 25 '13 at 08:15
  • @Dualinity what you do mean? is it what I've just put in the edit? If so, I just copy and paste your DF into `read.table(text="PASTE-HERE", header=TRUE)` – Jilber Urbina Apr 25 '13 at 17:17