1

I have a dataset in which I have 450.000 columns and 660 rows. The first 330 rows are group "A" and the last 330 group "B". I would like to calculate the correlation per column between group A and group B.

so far I managed:

setkey(df, group)
cor(df["A"]$value, df["B"]$value)

Which returns me the correlation between the two groups for the first column.

However, I want to do this for all the 450.000 columns where I get in a new data frame with the column name and the correlation between the two groups.

Furthermore, I have to take into account that the first row of group A (row 1) is related to the first row of group B (row 331), the second of group A with the second of group B (row 2 and row 332) and so on.

Does anyone here have an idea how to achieve this in R?

Thank you all.

Florian
  • 24,425
  • 4
  • 49
  • 80
Silv
  • 13
  • 3

3 Answers3

2
# sample data
df = data.frame(a=runif(660,1,10),b=runif(660,1,10),c=runif(660,1,10))

data.frame(corr=sapply(df,function(x) {cor(x[1:330],x[331:nrow(df)])}))

Output:

         corr
a -0.05902668
b  0.03443904
c -0.09899892
Florian
  • 24,425
  • 4
  • 49
  • 80
  • Thanks, I modified my answer, and I'll try and find more info on why that is considered bad practice. – Florian Jul 20 '17 at 13:49
  • Basically `apply` converts your data frame to matrix first. You can find a lot of info [here](https://stackoverflow.com/questions/3505701/r-grouping-functions-sapply-vs-lapply-vs-apply-vs-tapply-vs-by-vs-aggrega) – Sotos Jul 20 '17 at 13:59
2

Here is a purrr solution. map_df returns a data frame.

Sample data:

df<-data.frame(a1=rnorm(660,50,20),a2=rnorm(660,50,20))

And, the correlation between groups (a and b) in every column:

library(purrr)
map_df(df, ~{cor(.[1:330],.[331:660])})

Returns

# A tibble: 1 × 2
#           a1           a2
#        <dbl>        <dbl>
#1 -0.09949217 -0.008308669
P.R
  • 300
  • 1
  • 7
0

Try it with looping over all columns.

df<-data.frame(a1=rnorm(660),a2=rnorm(660))
cordf<-numeric()
for(i in 1:ncol(df)){cordf[i]<-cor(df[1:330,i],df[331:660,i])}
names(cordf)<-names(df) 

cordf contains the correlations between the first and last 330 rows and is named after the original variable names in the data frame.

Alex2006
  • 289
  • 5
  • 16
  • If you're going to use a `for` loop, you should pre-allocate the memory needed for `cordf`, i.e. `cordf <- numeric(ncol(df))` – bouncyball Jul 20 '17 at 14:13
  • Good suggestion. Thanks. – Alex2006 Jul 20 '17 at 14:41
  • Thank you for the answer @alex2006 and the additional suggestion bouncyball. I managed to get the correlations but I am not sure if this syntax also takes into account that row 1 and 331, 2 and 332.... are related to each other. Or does it? – Silv Jul 21 '17 at 09:25