0

I am trying to conduct a simple linear regression analysis using a data frame with 4 columns (all of which are dependent variables) and a dataframe having 194 rows and 212 columns. I have 5 other data frames to use as dependent variable for the same analysis

I have achieved the desired results but I need to scale this out, I have tried to add in an extra for loop (for the columns of the dependent variable) but i would also need to simultaneously create more empty lists.

I would like to know how would I achieve this?

My current for-loop is:

y <- data.frame(Green_Class_Commercial[,-1])
x <- data.frame(lagvar[1:175,c(-1,-2)])
out <- data.frame(NULL)              # create object to keep results

for (i in 1:length(x)) {
  m <- summary(lm(y[,1] ~ x[,i]))    # run model
  out[i, 1] <- names(x)[i]           # print variable name
  out[i, 2] <- m$coefficients[1,1]   # intercept
  out[i, 3] <- m$coefficients[2,1]   # coefficient
  out[i, 4] <-m$coefficients[2,4]    # Pvalue
  out[i,5] <-m$r.squared             # R-squared
}
names(out) <- c("Variable", "Intercept", "Coefficient","P-val","R-square")
head(out)

Giving the output

> head(out)
               Variable Intercept   Coefficient     P-val     R-square
1                GDP.SC 0.2540527 -4.722220e-07 0.7032087 8.411229e-04
2               GDP.SC1 0.1148311  3.107631e-07 0.7959237 3.899366e-04
3               GDP.SC2 0.1609010  4.998762e-08 0.9673014 9.855831e-06
4               GDP.SC3 0.1353608  1.959274e-07 0.8746321 1.468544e-04
5               GDP.SC4 0.1439931  1.487237e-07 0.9064221 8.200597e-05
6 CivilianLaborForce.SC 0.2595231 -4.078450e-08 0.7716514 4.881398e-04
> 

So Here is the Variables I want to run the regression

#The x Variable
structure(list(GDP.SC = c(154698, 154698, 154698, 154698, 154698, 
154698, 154698, 154698, 154698, 154698, 160138.4, 160138.4, 160138.4, 
160138.4, 160138.4, 160138.4, 160138.4, 160138.4, 160138.4, 160138.4
), GDP.SC1 = c(NA, 154698, 154698, 154698, 154698, 154698, 154698, 
154698, 154698, 154698, 154698, 160138.4, 160138.4, 160138.4, 
160138.4, 160138.4, 160138.4, 160138.4, 160138.4, 160138.4), 
    GDP.SC2 = c(NA, NA, 154698, 154698, 154698, 154698, 154698, 
    154698, 154698, 154698, 154698, 154698, 160138.4, 160138.4, 
    160138.4, 160138.4, 160138.4, 160138.4, 160138.4, 160138.4
    ), GDP.SC3 = c(NA, NA, NA, 154698, 154698, 154698, 154698, 
    154698, 154698, 154698, 154698, 154698, 154698, 160138.4, 
    160138.4, 160138.4, 160138.4, 160138.4, 160138.4, 160138.4
    ), GDP.SC4 = c(NA, NA, NA, NA, 154698, 154698, 154698, 154698, 
    154698, 154698, 154698, 154698, 154698, 154698, 160138.4, 
    160138.4, 160138.4, 160138.4, 160138.4, 160138.4)), row.names = c(NA, 
20L), class = "data.frame")

#The Y Variable
structure(list(X = 1:20, ComBus = c(0.83, 0, 0.23, 0.09, 0.1, 
0.11, 0.15, 0.18, 0.37, 0.19, 0, 0.18, 0.09, 0.1, 0.03, 0.5, 
0.14, 0.17, 0.11, 0.06), ComCon = c(NA, 0, 0, 0, 0, 0.5, 0, 0, 
NA, 0.67, 0, 0, 0, 0, 0.5, 0, 0, NA, 1, 0), ComNoo = c(0.25, 
0.14, 0.38, 0.17, 0.14, 0.33, 0.44, 0.05, 0.04, 0.1, 0.18, 0.06, 
0.23, 0.14, 0.5, 0.14, 0.5, 0, 0.14, 0.23), ComOO = c(0, 0, 0, 
0, 0, 0.33, 0, 0, 0, 0.18, 0.22, 0.15, 0, 0, 0.17, 0, 0, 0, 0, 
0)), row.names = c(NA, 20L), class = "data.frame")
Jonathan Hall
  • 75,165
  • 16
  • 143
  • 189
Jerald Achaibar
  • 407
  • 5
  • 9
  • applying it onto other data frames? this sentence "I need to scale this out, I have tried to add in an extra for loop" is a bit vague and your example doesn't show what else you will like to do – StupidWolf Apr 06 '20 at 23:43
  • Why not keep all regression data in a single data frame? – Parfait Apr 07 '20 at 00:02
  • Hey, there are several possible solutions to this one. If you require doing the same thing lots of times I would usually write it as a function and feed it the dataframes (or `map` it to the dataframes). Something like this that outputs each `lm` as a list element: `lm_func <- function(d1, df2) { out <- apply(df, MARGIN = 2, function(x) { lm(df2[, 1] ~ x) }) return(out) }` `broom::glance` might be useful here too. I can try give a more full answer if this is on the right lines for you. – QAsena Apr 07 '20 at 00:13
  • @StupidWolf Sorry about that, so ```y=Green_Class_Commercial``` is a dataframe with 175 rows and 4 columns , I also have 5 similar data frames for that i need to substitute in for y (All of which have four columns) that i need to be the response variable in the regression model. X is a dataframe with 194 rows and 212 columns (And i need to run a simple linear regression one at a time with each of the 212 variables). In my current code i have done that for one column of the Green_Class_Commercial dataframe and i wish to do this again with columns 2-4. Let me know if this helps thanks – Jerald Achaibar Apr 07 '20 at 00:54
  • @Parfait thanks for the reply, That would make sorting through the dataset much harder. The Y variables represent the probability that someone defaults on a loan in greenville south carolina, this was a dataset given to me by my professor for my current class many of those probabilities were filtered based on status (classified or passed ) and location (Greenville or other ) also 6 other diffferent sectors (Consumer business, commercial business etc. ) – Jerald Achaibar Apr 07 '20 at 00:57
  • @QAsena thanks for the help, yes would you mind explaining that a bit more? when you say ```apply(df, ``` which data frame would this refer too? would i use this in the for loop? thank you again – Jerald Achaibar Apr 07 '20 at 01:07
  • @Jerald Achaibar. `apply` is replacing the loop in this case. I find it easier to work with in many cases. I've posted an answer which is more carefully explained using `x_df` and `y_df`. Possibly using `glance` is not the right option, let me know. – QAsena Apr 07 '20 at 02:28
  • In `lm` you can refer to named columns of data frame, using *data* argument. Please `dput` your data or mock up sample data for [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). I will show how easy it is to use a single data frame with dynamic formula for `lm`. – Parfait Apr 07 '20 at 03:53
  • @Parfait @QAsena Thank you all so much for the help, I have figured out how to do the `dput` function and have included that in my question – Jerald Achaibar Apr 08 '20 at 03:56

2 Answers2

1

Ok, is this good for you? I'm replacing the loop with apply if that is ok?

### Some dummy dataframes
x <- data.frame(v1 = rnorm(1:10),
                 v2 = rnorm(1:10),
                 v3 = runif(10, 1, 1000),
                 v4 = runif(10, 1, 1000))
x2 <- data.frame(v1 = rnorm(1:10),
                v2 = rnorm(1:10),
                v3 = runif(10, 1, 1000),
                v4 = runif(10, 1, 1000))
y <- data.frame(v1 = rnorm(1:10),
                v2 = rnorm(1:10),
                v3 = runif(10, 1, 1000),
                v4 = runif(10, 1, 1000))
y2 <- data.frame(v1 = rnorm(1:10),
                 v2 = rnorm(1:10),
                 v3 = runif(10, 1, 1000),
                 v4 = runif(10, 1, 1000)) 

###
# I tend to prefer the apply family of functions to replace loops where possible.
# This function takes two inputs, dataframes of dependent and independent variables.
# the apply function here takes the x_df and applies the following anonymous function to each column
# so for each column in x_df it performs a lm against the first column of y_df

lm_func <- function(y_df, x_df) {
  out <- apply(x_df, MARGIN = 2, function(x) {
    lm(y_df[, 1] ~ x)
  })
  return(out)
}

results_list <- lm_func(y, x)

# the output is one list element per lm. I like to keep the whole lm output just in case you need to go back to it

# we can then turn that list back into a dataframe using rbindlist from data.table
# and get what I think is your desired output using glance from broom

library(data.table)
library(broom)

results_glance <- rbindlist(lapply(results_list, glance), idcol = "var_name")

# or keep it as a list if you wish
results_list_glance <- lapply(results_list, glance)

# to run the function using a single x argument but multiple y arguments you can use mapply

results_list_m <- mapply(lm_func,
                       y_df = list(y, y2),
                       MoreArgs = list(    # other arguments you want to keep fixed
                         x_df = x
                       ),
                       SIMPLIFY = F
)

# the output is a little fiendish because it will be a list of lists
# we can include the rbindlist and glance into the function to make the output a little simpler:


lm_func_bind <- function(y_df, x_df) {
  out <- apply(x_df, MARGIN = 2, function(x) {
    lm(y_df[, 1] ~ x)
  })
  out <- rbindlist(lapply(out, glance), idcol = "var_name")
  return(out)
}
results_glance_df <- lm_func_bind(y, x)

results_list_dfs <- mapply(lm_func_bind,
                           y_df = list(y, y2),
                           MoreArgs = list(    # other arguments you want to keep fixed
                             x_df = x
                           ),
                           SIMPLIFY = F
)

Let me know if I can make this better. If you are not familiar with some of the functions like apply and rbindlist they are worth checking out the documentation for. Cheers!

P.S. Usually repeated linear models is not ideal due to chance of success. That is more of a question of stats rather than coding though!

QAsena
  • 603
  • 4
  • 9
  • This was very informational, thank you for taking the time to walk me through step by step! However the only problem is now i get different p-values, any idea why this would be happening? I used ``` head(results_glance[1:5,c(1,6)])``` and the results were very different compared to above. – Jerald Achaibar Apr 07 '20 at 03:45
  • var_name p.value 1: GDP.SC 0.9894016 2: GDP.SC1 0.9957419 3: GDP.SC2 0.9998764 4: GDP.SC3 0.9978171 5: GDP.SC4 0.9951122 – Jerald Achaibar Apr 07 '20 at 03:49
  • mmm I'm not sure. If I run your loop on the same dummy dataframes as my example I get the same result for both. Tricky to check over without a reproducible example, can you provide one? `dput(YOURDATA)`, or `head(dput(YOURDATA), n = 20)` can help if the frames are not crazy large. Any leftover objects or something in your environment? Maybe test on a fresh R session if you haven't already? Or maybe your real dataframes are a little more complicated than the simple dummy ones? – QAsena Apr 07 '20 at 04:12
  • I might have spotted the problem, In your comment above you mention substititing 5 dataframes for `y`. The code I gave you the `y` dataframe stays the same and `x` dataframe changes... Easy to fix if this is my mistake, just let me know and I'll update the answer, – QAsena Apr 07 '20 at 09:23
  • yes i see this now, the Y data frame is the one I am changing, Thank you – Jerald Achaibar Apr 07 '20 at 15:58
  • Ok, I swapped the x and y in the `mapply` call. Now y changes and x stays the same. This work for you? It's a little tricky for me to read the data from your comments directly into R as they are not formatted. the `dput()` function provides a structure that can easily be copy-pasted in and you can give it just a slice of your data for example: `dput(data[1:10,1:3])`. You can paste the output from `dput()` into your question so that anyone can experiment with it. – QAsena Apr 07 '20 at 20:24
  • I have added the dput in the question thank you @QAsena! The change you made to the mapply function seems like it would work however i get the following when i try to rbind i get `results_glance2 <- rbindlist(lapply(results_list_m, glance), idcol = "var_name")Error: No glance method recognized for this list.` – Jerald Achaibar Apr 08 '20 at 04:05
  • yep, so `results_list_m` runs the function `lm_func` and gives a list of lists so you would need `rbindlist(lapply(unlist(results_list_m, recursive = F), glance), idcol = "var_name")` . Or you could use the second function `lm_func_bind` which uses `rbindlist` along the way and gives a list of `tibbles`/`dataframes` already bound.. The `mapply` stuff takes a bit of getting used to. – QAsena Apr 08 '20 at 04:27
  • hello again @QAsena, needing some additional assistance here [link](https://stackoverflow.com/questions/61235586/how-to-run-run-multiple-regression-and-prediction-for-multiple-columns-of-data) – Jerald Achaibar Apr 15 '20 at 18:21
1

Consider a nested lapply with outer call traversing each column of dependent variable data frame and each time inner call iterates through all columns of independent variable data frame:

reg_data <- function(yvar, xdf) {
    # ITERATE THROUGH EACH COLUMN OF x
    df_list <- lapply(seq_along(xdf), function(i) {
      m <- summary(lm(yvar ~ x[,i]))        # run model

      data.frame(
         Variable = names(x)[i],            # print variable name
         Intercept = m$coefficients[1,1],   # intercept
         Coefficient = m$coefficients[2,1], # coefficient
         P_val = m$coefficients[2,4],       # P-value
         R_square = m$r.squared             # R-squared
      )
    })

   return(do.call(rbind, df_list))
}

# ITERATE THROUGH EACH COLUMN OF y
model_dfs <- lapply(y[-1], function(col) reg_data(col, x))

Output

model_dfs

# $ComBus
#   Variable  Intercept   Coefficient     P_val    R_square
# 1   GDP.SC  2.6988486 -1.599147e-05 0.3262406 0.053555129
# 2  GDP.SC1 -0.1802638  2.083180e-06 0.8452577 0.002304901
# 3  GDP.SC2  0.4443504 -1.838100e-06 0.8656578 0.001843828
# 4  GDP.SC3 -0.2114691  2.310754e-06 0.8410848 0.002767098
# 5  GDP.SC4 -0.4596142  3.921280e-06 0.7517165 0.007381776

# $ComCon
#   Variable  Intercept  Coefficient     P_val    R_square
# 1   GDP.SC -0.4342988 3.752788e-06 0.8970060 0.001154220
# 2  GDP.SC1 -1.5050149 1.056908e-05 0.7148913 0.009154924
# 3  GDP.SC2 -2.2666678 1.549256e-05 0.6144502 0.018606737
# 4  GDP.SC3 -3.2822050 2.205720e-05 0.5032585 0.035178198
# 5  GDP.SC4 -4.7039571 3.124770e-05 0.3808557 0.064522691

# $ComNoo
#   Variable   Intercept  Coefficient     P_val    R_square
# 1   GDP.SC -0.02348033 1.470480e-06 0.9087818 0.000749555
# 2  GDP.SC1 -0.33062799 3.410697e-06 0.8011075 0.003836926
# 3  GDP.SC2 -1.11901191 8.455261e-06 0.5536610 0.022365205
# 4  GDP.SC3 -1.58084828 1.134370e-05 0.4400999 0.040243587
# 5  GDP.SC4 -2.12276002 1.482734e-05 0.3493362 0.062765524

# $ComOO
#   Variable   Intercept   Coefficient     P_val    R_square
# 1   GDP.SC -0.03430512  5.514300e-07 0.9481025 0.000241968
# 2  GDP.SC1  1.13773433 -6.882664e-06 0.4347716 0.036277535
# 3  GDP.SC2  1.98603902 -1.226932e-05 0.1785540 0.110105644
# 4  GDP.SC3  1.89971836 -1.171132e-05 0.2291494 0.094842038
# 5  GDP.SC4  1.78462415 -1.096733e-05 0.2963624 0.077531366

Online Demo

Parfait
  • 104,375
  • 17
  • 94
  • 125
  • !!!! This is exactly what i needed thank you so much for the help. @Parfait ! Now i noticed you have a negative in front of the the coefficient for the p-value is there a particular reason for this? Also I am having some issues a wide range of p-values, and error when i try do this same regression without the lapply saying the number of rows differ. Could this cause the discrepancy in the P-values i am getting? – Jerald Achaibar Apr 08 '20 at 21:49
  • @Parfait Interesting use of `lapply` there. Not used it with `seq_along` before. Up-vote for that. – QAsena Apr 08 '20 at 22:13
  • Negative p-values was a typing error. I removed hyphen before its extracted value from *coefficients* table. Vectors in `lm` must be the same length to run properly so no you cannot run regression across different length data frames. – Parfait Apr 08 '20 at 22:18
  • hello again @Parfait, needing some additional assistance here [link](https://stackoverflow.com/questions/61235586/how-to-run-run-multiple-regression-and-prediction-for-multiple-columns-of-data) – Jerald Achaibar Apr 15 '20 at 18:21