1

I have a large data base looking like this:

      >tms
    expId     id     date sessionNr waveSh ipi isi perc    ampl qual eventNr
1   b80M1 myrthe 20131206         1      2  20   1   80  416.10    1     145
2   b80M1 myrthe 20131206         1      2   4   2   80  366.80    1     146
3   b80M1 myrthe 20131206         1      2   4   3   80  411.60    1     147
    ..... ...... ........         .      .   .   .   ..  ......    .     ... 
    ..... ...... ........         .      .   .   .   ..  ......    .     ...
24  m80M1 myrthe 20131218         1      1  20   2   80   58.10    1     266
25  m80M1 myrthe 20131218         1      1   4   1   80   22.60    0     267
26  m80M1 myrthe 20131218         1      1   4   3   80   21.90    0     268
    ..... ...... ........         .      .   .   .   ..  ......    .     ...
    ..... ...... ........         .      .   .   .   ..  ......    .     ...
201 h80M1 myrthe 20131219         1      3   5   3   80   33.00    0     194
202 h80M1 myrthe 20131219         1      3   6   1   80   52.50    1     195
203 h80M1 myrthe 20131219         1      3   4   4   80  314.20    1     196

Within every tms$expId I would like to create a new variable called tms$norm. This variable represents the ratio between the tms$ampl and the mean of tms$isi==1 within an tms$expId which is tms$ampl/mean(tms[tms$isi==1,]$ampl).

I could take the long run like this, manually subsetting for every tms$expId:

    b80L1 <- tms[tms$expId==b80L1,]
    attach(b80L1)
    b80L1$norm <- b80L1$ampl/mean(b80L1[b80L1$isi==1,]$ampl)
    detach(b80l1)

    m80M1 <- tms[tms$expId==m80M1,]
    attach(m80M1)
    M80M1$norm <- M80M1$ampl/mean(m80M1[m80M1$isi==1,]$ampl)
    detach(m80M1)

    h80M1 <- h80M1[h80M1$expId==h80M1,]
    attach(h80M1)
    h80M1$norm <- h80M1$ampl/mean(h80M1[h80M1$isi==1,]$ampl)
    detach(h80M1)

And then combine all subsets again in one data frame like this:

    tmsNorm <- rbind(b80L1,m80M1,h80M1)

Then the tmsNorm database would look like this:

      >tmsNorm
    expId     id     date sessionNr waveSh ipi isi perc    ampl qual eventNr  norm
1   b80M1 myrthe 20131206         1      2  20   1   80  416.10    1     145  0.6547
2   b80M1 myrthe 20131206         1      2   4   2   80  366.80    1     146  0.5667
3   b80M1 myrthe 20131206         1      2   4   3   80  411.60    1     147  0.6530
    ..... ...... ........         .      .   .   .   ..  ......    .     ...  ...
    ..... ...... ........         .      .   .   .   ..  ......    .     ...  ...
24  m80M1 myrthe 20131218         1      1  20   2   80   58.10    1     266  0.0123
25  m80M1 myrthe 20131218         1      1   4   1   80   22.60    0     267  0.0056
26  m80M1 myrthe 20131218         1      1   4   3   80   21.90    0     268  0.0057
    ..... ...... ........         .      .   .   .   ..  ......    .     ...  ...
    ..... ...... ........         .      .   .   .   ..  ......    .     ...  ...
201 h80M1 myrthe 20131219         1      3   5   3   80   33.00    0     194  0.0045
202 h80M1 myrthe 20131219         1      3   6   1   80   52.50    1     195  0.0053
203 h80M1 myrthe 20131219         1      3   4   4   80  314.20    1     196  0.0145

However, as I have approximately 100 types of tmse$expId, I would really like to create this tms$norm variable using a loop function or some kind of apply function.

I tried using this code which does not work but hopefully indicates what I'm trying to achieve:

    uniq <- unique(unlist(tms$expId))
   > for(i in 1:length(uniq)){
       attach(tms[tms$expId==uniq[i], ])
       tms$normReal2 <- tms[tms$expId==uniq[i], ]$realAmpl/mean(tms[(tms$expId==uniq[i]) |       (tms$isi==1),]$realAmpl)
       detach(tms[tms$expId==uniq[i], ])
     }

So my question is: how do I achieve to create this tms$norm variable using a very compact code?

Thank yo very much in advance!

RmyjuloR
  • 369
  • 1
  • 4
  • 13
  • Take a look at the `aggregate` function. – RoyalTS Jan 24 '14 at 23:21
  • @user3233153 It's not clear if you are interested in a subset of the data where the condition isi==1 is true or the complete dataset with NAs – marbel Jan 24 '14 at 23:51
  • I'm interested in the mean value of ampl if isi==1, per expId. So in the data frame selection above, this will be 3 values. – RmyjuloR Jan 25 '14 at 07:36

3 Answers3

2

Try dplyr.

install.packages('dplyr')
require(dplyr)

tms <- group_by(tms, expId)
tms <- mutate(tms, norm = ampl / mean(ampl[isi == 1]))
Vincent
  • 5,063
  • 3
  • 28
  • 39
  • Wow, that is a very quick answer. – RmyjuloR Jan 24 '14 at 23:56
  • However, the package won't install because I have the combo of OS X and R version 3.0.2. Is there any chance this can be done without a package? Or, how can I install the package with the combo of OS X and R version 3.0.2? – RmyjuloR Jan 24 '14 at 23:57
  • Sorry, also doesn't work. When also installing the package "assert that" it returns : Installing github repo(s) dplyr/master from hadley Downloading dplyr.zip from https://github.com/hadley/dplyr/archive/master.zip Installing package from ... Error: package 'Rcpp' 0.10.4 was found, but >= 0.10.6 is required by 'dplyr-master' * removing '/Library/Frameworks/R.framework/Versions/3.0/Resources/library/dplyr' Error: Command failed (1)" – RmyjuloR Jan 25 '14 at 00:15
  • I'm using windows and couldn't install it from github, who knows why. I have devtools, Rtools, etc installed. However, it worked with `install.packages('dplyr')`. Strange.. – marbel Jan 25 '14 at 00:19
  • Got it finally installed and this works very smooth now! Thank you very much for your help! – RmyjuloR Jan 25 '14 at 08:27
1
require(data.table)

set.seed(123)
tms <- data.table(row_id = 1:10e5,
                  expId = letters,
                  isi = c(1,2,3),
                  ampl = rnorm(10e5, 300, 100)
)

EDIT

# Keep the original table
tms[isi==1, isi1_mean:=mean(ampl), by=expId]  # The mean of the ampl column when isi==1 by expId
tms[isi==1, norm:=(ampl/isi1_mean), by=expId]


# Keep a subset where isi==1
tms <- data.table(row_id = 1:10e5,
                  expId = letters,
                  isi = c(1,2,3),
                  ampl = rnorm(10e5, 300, 100)
                  )

tms[isi==1, isi1_mean:=mean(ampl), by=expId]  # The mean of the ampl column when isi==1 by expId
tms[, norm:=(ampl/isi1_mean), by=expId][isi==1]

Benchmark

I'm trying a dplyr version as data.table and one as data.frame. I believe this is the rigth way but please edit the benchmark if there is something incorrect.

require(dplyr)
require(data.table)

set.seed(123)
tms_dt <- data.table(row_id = 1:10e6,
                     expId = letters,
                     isi = c(1,2,3),
                     ampl = rnorm(10e6, 300, 100)
                     )

tms_df <- as.data.frame(tms_dt)

dt_dplyr <- function(data) {
  require(dplyr)
  data <- group_by(data, expId)
  data <- mutate(data, norm = ampl / mean(ampl[isi == 1]))
}

df_dplyr <- function(data) {
  require(dplyr)
  data <- group_by(data, expId)
  data <- mutate(data, norm = ampl / mean(ampl[isi == 1]))
}

dt_datatable <- function(data) {
  data[isi==1, isi1_mean:=mean(ampl), by=expId]
  data[isi==1, norm:=(ampl/isi1_mean), by=expId]
}


require(rbenchmark)
benchmark(dt_dplyr(tms_dt), df_dplyr(tms_df), dt_datatable(tms_dt))
                  test replications elapsed relative user.self sys.self 
2     df_dplyr(tms_df)          100  135.08    1.447    108.81    20.94
3 dt_datatable(tms_dt)          100   93.36    1.000     76.15    16.63
1     dt_dplyr(tms_dt)          100  275.28    2.949    105.34    72.63
marbel
  • 7,560
  • 6
  • 49
  • 68
  • but in this case mean(ampl))][isi==1] is not calculated per expId, but rather over the whole data set? per expId the mean(tms[tms$isi==1,]) is different... – RmyjuloR Jan 25 '14 at 00:01
  • @user3233153 is it rigth now? I'm adding the condition that tms$isi==1, would you like this or rather have the original table and NAs where isi is not 1? – marbel Jan 25 '14 at 00:08
  • again an error: > result <- tms[isi==1, norm:=(ampl/mean(ampl)), by=expId][isi==1] Error in `[.data.frame`(tms, isi == 1, `:=`(norm, (ampl/mean(ampl))), : unused argument (by = expId) – RmyjuloR Jan 25 '14 at 00:18
  • It works in my machine, in a new R session. Try downloading the dev version from R-forge. http://stackoverflow.com/questions/18772277/installing-new-version-of-data-table-specifically-1-8-11-from-rforge – marbel Jan 25 '14 at 00:21
  • for every row within a expId the new variable tms$norm should be calculated. In other words every tms$ampl value should be divided by the value of "mean(tms[tms$isi==1,]$ampl". this mean value is different within every expId – RmyjuloR Jan 25 '14 at 00:21
  • I'm entering the following in the console: install.packages("dplyr", + repos= "http://R-Forge.R-project.org", + type="source") this also returns the error: "Warning message: package ‘dplyr’ is not available (for R version 3.0.2)" – RmyjuloR Jan 25 '14 at 00:24
  • I thought you were meaning to install the new version of `data.table`! – marbel Jan 25 '14 at 00:26
  • Thanks for the update. Unfortunatly I now get the error: "Error in `[.data.frame`(tms, , `:=`(norm, (ampl/isi1_mean)), by = expId) : unused argument (by = expId)" – RmyjuloR Jan 25 '14 at 00:30
  • are you using your data? or what i've provided in the answer? If it's your dataset, first coerce it to a data.table. `tms <- data.table(tms)` – marbel Jan 25 '14 at 00:31
  • Oh! so sorry. very ignorant of me – RmyjuloR Jan 25 '14 at 00:33
  • Yes! it adds a new variable norm to my data. thank you so much. The values however look strange… So you're sure every mean(tms[tms$isi==1,]$ampl) is unique for every expId? – RmyjuloR Jan 25 '14 at 00:37
  • @user3233153 Sorry, i believe the edit should be what you want. The `:=` inserts new columns, so you don't need to copy the table. – marbel Jan 25 '14 at 00:42
  • If you feel some of the answers were useful, consider accepting one as this is how the site works. – marbel Jan 25 '14 at 00:49
  • the problem could be that "isi1_mean = tmsNew[isi==1,mean(ampl)]" calculates the mean of the tms[tms$isi==1]$ampl over all tms$expIds. This should not be calculated over the whole dataset, but per subset of tms$expId. Sorry thatI'm being so demanding :( – RmyjuloR Jan 25 '14 at 00:53
  • 1
    I'll take some time now to better understand your answers. Dont't understand me wrong. Your answers were really helpful! – RmyjuloR Jan 25 '14 at 00:55
  • Sure. Have you tried: `tms[isi==1, isi1_mean:=mean(ampl), by=expId]` the part under **EDIT**. I believe this is what you need. – marbel Jan 25 '14 at 01:01
0

Can't recommend strongly enough taking the time to read this: http://www.jstatsoft.org/v40/i01/paperSplit Apply Combine

In this case, I think you want

install.packages(plyr)
require(plyr)
ddply(tms, .(expId), function(x) {data.frame(x, norm=x$ampl/mean(x[x$isi==1,]["ampl"]))})

Excuse me if this answer comes out with a typo - not in a position to test it myself.

SauceCode
  • 287
  • 1
  • 3
  • 9
  • Thanks for the quick answer. The link sound extremely useful, will read it asap. I tried the code however and per "expId" it results in a warning: "In mean.default(x[x$isi == 1, ]["ampl"]) : argument is not numeric or logical: returning NA" – RmyjuloR Jan 25 '14 at 00:08