Standardize not among columns, but small parts of columns, using R

Question

I have a multilevel structure, and what I need to do is standardize for each individual (which is the higher level unit, each having several separate measures).

Consider:

  ID measure score
1  1       1     5
2  1       2     7
3  1       3     3
4  2       1    10
5  2       2     5
6  2       3     3
7  3       1     4
8  3       2     1
9  3       3     1

I used apply(data, 2, scale) to standardize for everyone (this also standardizes the ID and measure, but that is alright).

However, how do I make sure to standardize seperately for ID == 1, ID == 2 and ID == 3? --> Each observation - mean of 3 scores, divided by standard deviation for 3 scores).

I was considering a for loop, but the problem is that I want to bootstrap this (in other words, replicate the whole procedure a 1000 times for a big dataset, so speed is VERY important).

Extra information: the ID's can have variable measurements, so it is not the case that they all have 3 measured scores.

The dput of the data is:

structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), measure = c(1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), score = c(5L, 7L, 3L, 10L, 5L, 
3L, 4L, 1L, 1L)), .Names = c("ID", "measure", "score"), class = "data.frame", row.names = c(NA, 
-9L))

Please give sample data or reproducible example so that good people here can help you better. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — CHP, Apr 15 '13 at 11:13
@Dualinity Sufficient or not, it is a good practice to give your data in a form which can be easily pasted into R locally. Currently one needs to retype your data from scratch instead of pasting. It is easier when you do this: mydata <- data.frame(x1=..., x2=...) — Maxim.K, Apr 15 '13 at 11:23
I think the piece of data illustrates good enough the problem, I edited the question according to the previous comments. — Jilber Urbina, Apr 15 '13 at 11:27
@Jilber, I don't think it's unfair to put the burden on the OP to provide a *reproducible example*. I've removed my down-vote after your edit. — Arun, Apr 15 '13 at 11:29
@MaximKovalenko I agree with you. Next time I'll supply the code to be able to reproduce it. — PascalVKooten, Apr 15 '13 at 11:53

Jilber Urbina · Accepted Answer · 2013-04-25T17:16:00.293

3

Here's an lapply with split solution and assuming your data is DF

> lapply(split(DF[,-1], DF[,1]), function(x) apply(x, 2, scale))
$`1`
     measure score
[1,]      -1     0
[2,]       0     1
[3,]       1    -1

$`2`
     measure      score
[1,]      -1  1.1094004
[2,]       0 -0.2773501
[3,]       1 -0.8320503

$`3`
     measure      score
[1,]      -1  1.1547005
[2,]       0 -0.5773503
[3,]       1 -0.5773503

An alternative which produces the same result is:

> simplify2array(lapply(split(DF[,-1], DF[,1]), scale))

This alternative avoids using apply inside lapply call.

Here's split divides the data into groups defined by ID and it returns a list, so you can use lapply to loop over each element of the list applying scale.

Using ddply from plyr as @Roland suggests:

> library(plyr)
> ddply(DF, .(ID), numcolwise(scale))
  ID measure      score
1  1      -1  0.0000000
2  1       0  1.0000000
3  1       1 -1.0000000
4  2      -1  1.1094004
5  2       0 -0.2773501
6  2       1 -0.8320503
7  3      -1  1.1547005
8  3       0 -0.5773503
9  3       1 -0.5773503

Importing your data (this is to answer the last comment)

DF <- read.table(text="  ID measure score
1  1       1     5
2  1       2     7
3  1       3     3
4  2       1    10
5  2       2     5
6  2       3     3
7  3       1     4
8  3       2     1
9  3       3     1", header=TRUE)

edited Apr 25 '13 at 17:16

answered Apr 15 '13 at 11:14

Jilber Urbina

58,147
10
114
138

By the way, thanks for showing how easy it is to import it like that! – PascalVKooten Apr 15 '13 at 11:18
If you add a little bit of explanation to it, I'll accept and upvote the answer (I'm not really sure yet what both functions do, while I'd be able to read it, I guess it improves the answer quality) – PascalVKooten Apr 15 '13 at 11:19
Btw, check the first part. The ddply solution seems good. For first ID: `scale(c(5,7,3))` -> `(0, 1, -1)`, but I cannot find these three values in that order in the first solution? – PascalVKooten Apr 15 '13 at 11:25
@Dualinity you're absolutely right, I edited the answer giving some explanation and fixing the mistake. – Jilber Urbina Apr 15 '13 at 11:29
Btw as a benchmark, the first solution is around 2 times faster (based on 1000 repetitions). Thank you. Still, the ordering is not good yet in that solution? – PascalVKooten Apr 15 '13 at 11:31
1

If speed matters and your dataset is big or you have many IDs, `data.table` might be the way to go. – Roland Apr 15 '13 at 11:35
It roughly takes 1 second for running ddply once on the dataset. I guess that'll have to be acceptable. – PascalVKooten Apr 15 '13 at 11:47
@Jilber The lapply solution doesn't really yield a ready-to-go object. I was not able to get it to show up in the way `ddply` has it. I am supposing that transferring that solution into matrix format will at least take as much time as the ddply solution. – PascalVKooten Apr 15 '13 at 11:48
@Jilber how again did you do the amazing copy paste import for which you just copied the text as a string? – PascalVKooten Apr 25 '13 at 08:15
@Dualinity what you do mean? is it what I've just put in the edit? If so, I just copy and paste your DF into `read.table(text="PASTE-HERE", header=TRUE)` – Jilber Urbina Apr 25 '13 at 17:17

Standardize not among columns, but small parts of columns, using R

1 Answers1