2

I have this dataframe in r (link) (Example rows and columns below)

FocalID     Mother   Adelaide Asimov Austen Brazzaville Lusaka Kinshasa
Adelaide    HalfEar    0      0.0380 0.0417      0.0366 0.0278   0.0385
Asimov      Lusaka     0.0380 0      0.0845      0.0357 0.169    0.0641
Austen      Kinshasa   0.0417 0.0845 0           0.0526 0.0952   0.0411
Brazzaville NA         0.0366 0.0357 0.0526      0      0.0395   0.0488

I would like to add a new variable, df$cor, in which the value in each row is the result of a correlation. The correlation should be between two vectors: (1) the column whose colname corresponds to the value of the variable df$FocalID in that row, and (2) the column whose colname corresponds to the value of the variable df$Mother in that row.

If the vector correspondent to the column that matches the mother's name is absent (either because the mother is not known (NA in df$Mother) or absent from colnames), the correlation should produce an NA.

I have tried the following code:

df$cor <- cor(df[, colnames(df) %in% df$FocalID], df[, colnames(df) %in% df$Mother])

However, the result doesn't seem right. Any idea?

aynber
  • 22,380
  • 8
  • 50
  • 63

1 Answers1

1

If we need to do this for each pairwise column, we check whether the 'FocalID', 'Mother' columns are non-NA with complete.cases. Then, loop over the columns specifying subsetting only the non-NA columns, with apply and MARGIN = 1, do a check for whether those elements are %in% the column names of the dataset, select the data, apply cor and create the new column Cor

i1 <-  complete.cases(df[1:2])
df$Cor <- NA_real_
df$Cor[i1] <-  apply(df[i1, 1:2], 1, function(x) 
   if(all(x %in% names(df))) cor(df[, x[1]], df[, x[2]]) else NA)

-output

df$Cor
#[1]          NA  0.09769710  0.26956397          NA  0.04820137 -0.07776837          NA  0.19553956 -0.09596063          NA  0.04806345
#[12]  0.66489746          NA          NA          NA -0.04254666 -0.05975570  0.47359966  0.09745244          NA          NA  0.24750130
#[23]          NA          NA          NA          NA          NA          NA          NA          NA  0.10822526          NA  0.07093166
#[34]          NA          NA -0.18088278 -0.17548394  0.11585058  0.07278060  0.36327624  0.10178624          NA          NA          NA
#[45]  0.20491334          NA

Or using rowwise from dplyr

library(dplyr)
df <- df %>%
        rowwise %>% 
        mutate(Cor = if(!is.na(FocalID) & !is.na(Mother) & 
           all(c(FocalID, Mother) %in% names(.)))
          cor(df[[FocalID]], df[[Mother]]) else NA_real_)

data

library(readxl)
df <- read_excel(file.choose(), na = "NA")
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 1
    Thank you! That's great! I find the second (dplyr) solution better because it is not dependent on knowing the numbers of the columns of interest. Is it also possible to 'change' the vectors to be correlated by removing the values resulting from the intersection of FocalID with itself and FocalID with its Mother? – Vittoria Roatti Dec 28 '20 at 15:59
  • @VittoriaRoatti Did you meant `df %>% filter(FocalID != Mother) %>% rowwise %>%..` – akrun Dec 28 '20 at 16:16
  • I am not sure. I mean that, to avoid including the individuals whose networks are involved in the correlation, the extracted vectors (e.g. for the second row in the above example: column "Asimov" for FocalID and column "Lusaka" for Mother) should miss the values that, in the dataframe, result from the interaction of the vector IDs with themselves and with each other (they should each have two values less: removing the values Asimov with Asimov and Asimov with Lusaka in the vector "Asimov"; removing the values Lusaka with Lusaka and Lusaka with Asimov in the vector "Lusaka"). – Vittoria Roatti Dec 28 '20 at 17:13
  • @VittoriaRoatti In that case, you may need `df %>% filter(FocalID != Mother, !duplicated(paste(pmin(FocalID, Mother), pmax(FocalID, Mother))))` – akrun Dec 28 '20 at 17:15
  • Thank you! I will try it now. It should be added before %>% rowwise %>% ..., right? – Vittoria Roatti Dec 28 '20 at 17:35
  • @VittoriaRoatti Yes, the `filter`ing comes before the `rowwise` – akrun Dec 28 '20 at 17:35