0

I have a data frame with 159 obs and 27 variables, and I want to correlate all 159 obs from column 4 (variable 4) with each one of the following columns (variables), this is, correlate column 4 with 5, then column 4 with 6 and so on... I've been unsuccessfully trying to create a loop, and since I'm a beginner in R, it turned out harder than I thought. The reason why I want to turn it more simple is that I would need to do the same thing for a couple more data frames and if I had a function that could do that, it would be so much easier and less time-consuming. Thus, it would be wonderful if anyone could help me.

 df <- ZEB1_23genes # CHANGE ZEB1_23genes for df (dataframe)

  for (i in colnames(df)){      # Check the class of the variables
         print(class(df[[i]]))
  }

print(df)

# Correlate ZEB1 with each of the 23 genes accordingly to Pearson's method


cor.test(df$ZEB1, df$PITPNC1, method = "pearson")
### OR ###
cor.test(df[,4], df[,5])

So I can correlate individually but I cannot create a loop to go back to column 4 and correlate it to the next column (5, 6, ..., 27).

Thank you!

  • Please take a look at [How to make a great R reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), to modify your question, with a smaller sample taken from your data (check `?dput()`). Posting images of your data or no data makes it difficult to impossible for us to help you! – massisenergy Mar 21 '20 at 16:58
  • Done (I think). Thanks! – Nuno Ramalho Mar 21 '20 at 17:28
  • What you've provided still isn't reproducible: you don't provide a copy of the data. The preceding comment points you to the relevant information for how to do this. – inhuretnakht Mar 21 '20 at 18:25

1 Answers1

1

If I've understood your question correctly, the solution below should work well.

#Sample data
df <- data.frame(matrix(data = sample(runif(100000), 4293), nrow = 159, ncol = 27))

#Correlation function
#Takes data.frame contains columns with values to be correlated as input
#The column against which other columns must be correlated cab be specified (start_col; default is 4)
#The number of columns to be correlated against start_col can also be specified (end_col; default is all columns after start_col)
#Function returns a data.frame containing start_col, end_col, and correlation value as rows.

my_correlator <- function(mydf, start_col = 4, end_col = 0){
    if(end_col == 0){
    end_col <- ncol(mydf)
  }
  #out_corr_df <- data.frame(start_col = c(), end_col = c(), corr_val = c())
  out_corr <- list()
  for(i in (start_col+1):end_col){
    out_corr[[i]] <- data.frame(start_col = start_col, end_col = i, corr_val = as.numeric(cor.test(mydf[, start_col], mydf[, i])$estimate))
  }
  return(do.call("rbind", out_corr))
}

test_run <- my_correlator(df, 4)

head(test_run)

#   start_col end_col     corr_val
# 1         4       5 -0.027508521
# 2         4       6  0.100414199
# 3         4       7  0.036648608
# 4         4       8 -0.050845418
# 5         4       9 -0.003625019
# 6         4      10 -0.058172227

The function basically takes a data.frame as an input and spits out (as output) another data.frame containing correlations between a given column from the original data.frame against all subsequent columns. I do not know the structure of your data, and obviously, this function will fail if it runs into unexpected conditions (for instance, a column of characters in one of the columns).

Dunois
  • 1,813
  • 9
  • 22
  • @Parfait thank you for noting that. I've fixed changed the code, and it uses `do.call("rbind"...)` now instead. – Dunois Mar 21 '20 at 19:46
  • 1
    Or use `lapply` and avoid bookkeeping of initializing list and appending to it: `out_corr <- lapply((start_col+1):end_col, function(i) data.frame(start_col = start_col, end_col = i, corr_val = as.numeric(cor.test(mydf[, start_col], mydf[, i])$estimate))`. – Parfait Mar 21 '20 at 22:31