How to predict the next column of a dataframe in R

Question

I have a dataframe with many columns, and for each row I want to predict what the value will be in the next column based on the values in the previous n columns.

I can do this one row at a time using the lm() function, and I could use sapply() to loop through each row, but I'm sure there is a neater approach.

I have also come across this question: split on factor, sapply, and lm - the suggestions here would also work, but would require me to use the gather function first, and again I'm sure there must be a neater way to do this.

The below shows a simplified, reproducible example of my data:

set.seed(0)
my_df <- data.frame(y1 = rnorm(10), y2 = rnorm(10), y3 = rnorm(10), y4 = rnorm(10), y5 = rnorm(10))

I could use the following code which gives me the result I'm after:

library(dplyr)
my_df_m <- sapply(1:10, FUN = function(x) {lm(as.numeric(my_df[x,1:5])~c(1:5))}[[1]][2])
my_df %>% mutate(y6 = y5 + my_df_m)

I could also use the following code, which requires some reformatting of my data first (apologies for the mix of data tables and frames, I confess I do not have much experience of the data.table package):

library(data.table)
library(tidyr)
my_df <- cbind(dataSet = c(1:10), my_df)
my_df_gather <- gather(my_df, key = "x", "y", -dataSet) %>%
  mutate(x = rep(1:5, each = 10))
my_DT <- data.table(my_df_gather)
my_df %>% mutate(y6 = y5 + my_DT[,list(m = lm(y~x)$coefficients[2]), by = dataSet]$m)

Neither approach shown above is ideal, the sapply method is fine on a small data set, but I have thousands of rows and I will need to repeat this step ~hundred times, so I would like to avoid looping to try and keep it as fast as possible. The second approach is probably closer to what I'm after, but requires significant reformatting of my data beforehand, which I would also like to avoid. So my question is: is there a function/package that I could use on my data as it is to predict the next column, without having to loop through each row?

score 0 · Accepted Answer · answered Aug 20 '19 at 10:47

There is no need for loops at all, lm can fit multiple regression models.

regr <- seq_len(ncol(my_df))
fit <- lm(t(my_df) ~ regr)

newdf <- data.frame(regr = ncol(my_df) + 1)
pred <- predict(fit, newdata = newdf)

pred
#       [,1]      [,2]       [,3]      [,4]       [,5]      [,6]
#1 0.6620746 0.4630261 -0.9693691 -1.309558 -0.7172375 0.4813361
#        [,7]      [,8]      [,9]     [,10]
#1 -0.1915871 0.7793626 0.9866084 -1.150298

How to predict the next column of a dataframe in R

1 Answers1