R: how to perform more complex calculations from a combn of a dataset?

Question

Right now, I have a combn from the built in dataset iris. So far, I have been guided into being able to find the coefficient of lm() of the pair of values.

myPairs <- combn(names(iris[1:4]), 2)

formula <- apply(myPairs, MARGIN=2, FUN=paste, collapse="~")

model <- lapply(formula, function(x) lm(formula=x, data=iris)$coefficients[2])

model

However, I would like to go a few steps further and use the coefficient from lm() to be used in further calculations. I would like to do something like this:

Coefficient <- lm(formula=x, data=iris)$coefficients[2]
Spread <- myPairs[1] - coefficient*myPairs[2]
library(tseries)
adf.test(Spread)

The procedure itself is simple enough, but I haven't been able to find a way to do this for each combn in the data set. (As a sidenote, the adf.test would not be applied to such data, but I'm just using the iris dataset for demonstration). I'm wondering, would it be better to write a loop for such a procedure?

Just wanted to be clear about what you seek. You want a solution which will provide a result (specifically the last 4 lines) for each combination without having to employ a loop. Is that right? — Analytical Monk, Jun 15 '16 at 17:21
I'm a little confused by your second block. Do you want to calculate spread for all pairs? What happens at the end (`myPairs[6] - coefficient*`???)? — TARehman, Jun 15 '16 at 17:24
@AnalyticalMonk Yes that's right, though if a loop is more efficient, I would not mind writing one. — Luke Zhang, Jun 15 '16 at 17:26
@TARehman Basically, I'm interested in the slope, or coefficient in the 2nd vector of myPairs[i]. With this coefficient, I'd like to calculate the spread, which uses this coefficient by subtracting the 2nd vector from the 1st, followed by the adf.test — Luke Zhang, Jun 15 '16 at 17:29
@DamienScott `myPairs` is the matrix of pairs - I'm confused. When you say the "2nd vector" and the "1st vector" what do you mean? Can you write an example that works on a single combn into your question? — TARehman, Jun 15 '16 at 17:34
`myPairs` is a character matrix. you can't do arithmetic with it. — Ernest A, Jun 15 '16 at 17:35
@TARehman I think Ernest A may have answered it for me. I meant that each combn is a pair of 2. By "first vector", I was thinking the first name in the pair. Eg. Sepal.Width ~ Petal.Length, Sepal.Width being the "first vector". I'll reexamine to see where I'm really going wrong. — Luke Zhang, Jun 15 '16 at 17:45
@DamienScott It sounds like maybe you are trying to access the columns in `iris` using the names in `myPairs`? ETA: I think I figured out what you were striking for - let me know. — TARehman, Jun 15 '16 at 17:48
@TARehman yes that's right,I was using the names to refer to the numerical values in its corresponding columns — Luke Zhang, Jun 15 '16 at 18:06

jkt · Answer 1 · 2016-06-16T09:28:12.350

2

Sounds like you would want to write your own function and call it in your myPairs loop (apply):

yourfun <- function(pair){
  fm <- paste(pair, collapse='~')
  coef <- lm(formula=fm, data=iris)$coefficients[2]
  Spread <- iris[,pair[1]] - coef*iris[,pair[2]] 
  return(Spread)
}

Then you can call this function:

model <- apply(myPairs, 2, yourfun)

I think this is the cleanest way. But I don't know what exactly you want to do, so I was making up the example for Spread. Note that in my example you get warning messages, since column Species is a factor.

edited Jun 16 '16 at 09:28

answered Jun 15 '16 at 17:33

jkt

946
1
7
18

I think you are making this needlessly complicated. First, you can use a `lapply()` to avoid an `apply()` call, but more generally, the `eval(parse())` bit is probably replaceable with named vectors. – TARehman Jun 15 '16 at 17:44
Yes, thanks, I am aware of the eval(parse()) complication, which was just a quick work around to avoid some strange issue (now resolved). I'll edit the answer; it might still be useful. – jkt Jun 16 '16 at 09:26

score 2 · Accepted Answer · answered Jun 15 '16 at 18:44

You can do all of this within combn.

If you just wanted to run the regression over all combinations, and extract the second coefficient you could do

fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2]
combn(names(iris[1:4]), 2, fun)

You can then extend the function to calculate the spread

fun <- function(x) {
         est <- coef(lm(paste(x, collapse="~"), data=iris))[2]
         spread <- iris[,x[1]] - est*iris[,x[2]]
         adf.test(spread)
        }

out <- combn(names(iris[1:4]), 2, fun, simplify=FALSE)
out[[1]]

#   Augmented Dickey-Fuller Test

#data:  spread
#Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
#alternative hypothesis: stationary

Compare results to running the first one manually

est <- coef(lm(Sepal.Length ~ Sepal.Width, data=iris))[2]
spread <- iris[,"Sepal.Length"] - est*iris[,"Sepal.Width"]
adf.test(spread)

#   Augmented Dickey-Fuller Test

# data:  spread
# Dickey-Fuller = -3.879, Lag order = 5, p-value = 0.01707
# alternative hypothesis: stationary

Thanks, other answer was good, but this is exactly what I was trying to do. — Luke Zhang, Jun 15 '16 at 19:02

score 1 · Answer 3 · answered Jun 15 '16 at 17:50

A few tips: I wouldn't name things that you with the same name as built-in functions (model, formula come to mind in your original version).

Also, you can simplify the paste you are doing - see the below.

Finally, a more general statement: don't feel like everything needs to be done in a *apply of some kind. Sometimes brevity and short code is actually harder to understand, and remember, the *apply functions offer at best, marginal speed gains over a simple for loop. (This was not always the case with R, but it is at this point).

# Get pairs
myPairs <- combn(x = names(x = iris[1:4]),m = 2)

# Just directly use paste() here
myFormulas <- paste(myPairs[1,],myPairs[2,],sep = "~")

# Store the models themselves into a list
# This lets you go back to the models later if you need something else
myModels <- lapply(X = myFormulas,FUN = lm,data = iris)

# If you use sapply() and this simple function, you get back a named vector
# This seems like it could be useful to what you want to do
myCoeffs <- sapply(X = myModels,FUN = function (x) {return(x$coefficients[2])})

# Now, you can do this using vectorized operations
iris[myPairs[1,]] - iris[myPairs[2,]] * myCoeffs[myPairs[2,]]

If I am understanding right, I believe the above will work. Note that the names on the output at present will be nonsensical, you would need to replace them with something of your own design (maybe the values of myFormulas).

You can do all this within `combn`: `fun <- function(x) coef(lm(paste(x, collapse="~"), data=iris))[2] ; combn(names(iris[1:4]), 2, fun)` — user20650, Jun 15 '16 at 18:32
Interesting, I didn't know that. Still seems to violate that third part about brevity not always being the best. — TARehman, Jun 15 '16 at 18:37

R: how to perform more complex calculations from a combn of a dataset?

3 Answers3

Linked