3

My question is almost answered in dplyr 0.3.0.9000 how to use do() correctly, but not quite.

I have some data that looks like this:

> head(myData)
   Sequence Index  xSamples ySamples
6         0     5 0.3316187 3.244171
7         0     6 1.5131778 2.719893
8         0     7 1.9088933 3.122991
9         0     8 2.7940244 3.616815
10        0     9 3.6500311 3.519641

The Sequence actually ranges from 0 to 9999. Within each Sequence both the xSamples and the ySamples should be linear with respect to Index. The plan is to group myData by Sequence and then use lm() via do() on each group. The code goes something like this (lifted shamelessly from the help):

library(dplyr)
myData_by_sequence <- group_by(myData, Sequence)
models <- myData_by_sequence %>% do(mod = lm(xSamples ~ Index, data = .))

This works, but the result I get is this . . .

> head(models)
Source: local data frame [10000 x 2]

  Sequence     mod
1        0 <S3:lm>
2        1 <S3:lm>
3        2 <S3:lm>
4        3 <S3:lm>
5        4 <S3:lm>
6        5 <S3:lm>

. . . and the data I want is stuck in that second column. I have a working plyr solution which goes like this . . .

models <- dlply(myData, "Sequence", function(df) lm(xSamples ~ Index, data = df))
xresult <- ldply(models, coef)

. . . and this gives me the results broken out into a data frame thanks to coef(). The catch is I can't mix dplyr (which I typically use and love) with plyr, and I can't seem to get coef() working with that second column from the dplyr output.

I've tried a few other approaches such as trying the coef() and lm() steps together, and I can break out the second column into a list of linear models, but I can't use do() on a list.

I really feel like there is something obvious I'm missing here. R is definitely not my primary language. Any help would be appreciated.

edit Have tried . . .

result <-
    rects %>% 
    group_by(Sequence) %>% 
    do(data.frame(Coef = coef(lm(xSamples ~ Frame, data = .))))

. . . and get something very close, but with the coefficients stacked in the same column:

  Sequence       Coef
1        0 -5.0189823
2        0  1.0004240
3        1 -4.9411745
4        1  0.9981858
Community
  • 1
  • 1
timbo
  • 1,533
  • 1
  • 15
  • 26
  • Try `myData %>% group_by(Sequence) %>% do(data.frame(Coef = coef(lm(xSamples~Index, data=.))))` – akrun Jul 21 '15 at 14:43
  • Thanks, your reply is good and I can work with the result, though the result is I have the coefficients arranged linearly instead of in columns (so the rows are alternating intercept, index). Appreciate the quick answer! – timbo Jul 21 '15 at 14:52
  • I was working with your head data. It would be better to provide a little more comprehensive example with the expected output. – akrun Jul 21 '15 at 14:53
  • 1
    Try `myData %>% group_by(Sequence) %>% do(data.frame(Coef = as.list(coef(lm(xSamples~Index, data=.)))))` – akrun Jul 21 '15 at 14:54
  • 1
    In the previous code, it was all stacked in a single column 'Coef'. I think by using `as.list`, it will be two columns, is that your expected result? – akrun Jul 21 '15 at 14:56
  • Perfect! Thanks! I can see I need to figure out this data type stuff in R a bit better. I find it a lot more opaque than Java or C++. – timbo Jul 21 '15 at 14:57
  • It seems to be slower than ldply which surprises me. Are there alternatives? I guess I could just loop, but that doesn't seem to be the R way. – timbo Jul 21 '15 at 15:03
  • 2
    Have you tried the `data.table` option. It should be fast – akrun Jul 21 '15 at 15:03
  • @akrun I'd love to see the `data.table` in action:) – Khashaa Jul 21 '15 at 15:06
  • @Khashaa Updated with a possible option – akrun Jul 21 '15 at 15:06
  • I used to use data.table but everyone else I knew was using dplyr so I moved away. I used to use data.table and loops, probably related to my C/Java upbringing. – timbo Jul 21 '15 at 15:07

1 Answers1

6

Try

library(dplyr) 
myData %>%
      group_by(Sequence) %>%
      do(data.frame(setNames(as.list(coef(lm(xSamples~Index, data=.))),
                 c('Intercept', 'Index')))
#    Sequence Intercept     Index
#1        0 -3.502821 0.7917671
#2        1  3.071611 0.3226020

Or using data.table

 library(data.table)
 setDT(myData)[, as.list(coef(lm(xSamples~Index))) , by = Sequence]
 #   Sequence (Intercept)     Index
 #1:        0   -3.502821 0.7917671
 #2:        1    3.071611 0.3226020

data

 myData <- structure(list(Sequence = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
 1L, 1L), Index = c(5L, 6L, 7L, 8L, 9L, 15L, 6L, 9L, 6L, 10L),
 xSamples = c(0.3316187, 
 1.5131778, 1.9088933, 2.7940244, 3.6500311, 7.3316187, 4.5131778, 
 9.9088933, 3.7940244, 4.6500311), ySamples = c(3.244171, 2.719893, 
 3.122991, 3.616815, 3.519641, 3.244171, 8.719893, 5.122991, 7.616815, 
 5.519641)), .Names = c("Sequence", "Index", "xSamples", "ySamples"
 ), class = "data.frame", row.names = c(NA, -10L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks for the speedy response! One of my versions was almost this but without the right type conversions. – timbo Jul 21 '15 at 15:00
  • Just curious: Suppose you wanted to keep the object with the lms. How would you then extract the coefficients? – Felipe Gerard Jul 21 '15 at 15:01
  • @FelipeGerard The `object` is a `list` column (if you are using only the `lm`. You can use `lapply/sapply` to extract the list elements – akrun Jul 21 '15 at 15:03
  • I should have tried that. I was trying to keep things in dplyr, I guess just because I thought it should all be nicer. – timbo Jul 21 '15 at 15:04