0

I want to correlate some variables taking into account the sample origin, but don't have the same amount of rows for each subset. In the title I say it is unpaired data because there are types of samples with more observations of a certain category than others. This issue could be 'solved' by taking the mean of the category per type of sample but it feels that like that I am throwing information away, so what is the correct way to pair this data?

For example, imagine I am studying 3 types of soil, collect my materials from 4 locations, and the sample analysis results in 3 variables. I am interested in understanding how each var in each soil are associated with one another (for example increase in var1 in soil type1 results in a decrease in var2 in soil type3). The problem is that the number of samples for soil and location are not the same, in some cases there are locations without a specific type of soil so correlating the variables as in the example above is not possible (or am I missing something here?). As said above, if I just take the mean of each soil for each location, the data would then be paired and I would be able to remove rows that don't have info of a given soil for a given comparison. However, using the mean sounds like throwing information away. How should this problem be addressed? Consider the following sample data (df) for this discussion:

df <- structure(list(location = c("A", "A", "A", "A", "B", "B", "B","B", "C", "C", "C", "C", "C", "C", "D", "D"), 
                     type_of_soil = c("type1", "type1", "type2", "type3", "type1", "type2", "type2", "type3", "type1", "type2", "type2", "type2", "type3", "type3", "type1", "type2"), 
                     var1 = c(29, 7, 11, 20, 12, 15, 22, 27, 29, 32, 31, 23, 22, 20, 20, 15), 
                     var2 = c(56, 92, 111, 15, 230, 23, 47, 50, 190, 10, 6, 2, 15, 73, 223, 50), 
                     var3 = c(1000, 1643, 1242, 6352, 111, 3552, 837, 5612, 999, 572, 127, 593, 3232, 524, 41, 2)), 
                class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -16L))

First I manipulate the data so that each combination of var and soil are considered one variable. This results in the dataframe going from 6 columns (3 vars of interest) to 11 columns (9 vars of interest). This will result in every row containing NAs and as such the argument use = "pairwise.complete.obs" of the cor function cannot be used anymore.

df$ID <- 1:nrow(df) #one ID for each original sample
df_longer <- pivot_longer(data = df, cols = -c(1,2,6) , names_to = "var", values_to = "vals")
df_cor <- unite(df_longer, var, type_of_soil, sep = '_', col = 'var_type')
df_wider <- pivot_wider(df_cor, names_from = var_type, values_from = vals)

In order to address this I wrote a function that receives df_wider, two strings containing soil types ('type1', 'type2', or 'type3'), and then:

  1. Subsets the data into two datasets (each containing the variables from a specific soil type);
  2. Keeps in each dataset only locations that are present in both of the new datasets;
  3. Inner_joins the datasets, pairing the observations in every combination possible;
  4. Re-splits the datasets into each type of soil again (so as to correlate only x.vars with y.vars);
  5. Plots a corrplot() of both datasets
bicorplot <- function(df_forcor, soil1, soil2){
  
  #cols that reference the soil of interest
  coord1 <- grep(soil1, colnames(df_forcor))
  coord2 <- grep(soil2, colnames(df_forcor))
  
  #remove from each subseted df the samples that don't have that type of soil
  df1 <- na.omit(df_forcor[,c(1, coord1)])
  df2 <- na.omit(df_forcor[,c(1, coord2)])
  
  #retain only samples that have both type_of_soil
  coord_eq1 <- df1[df1$location %in% df2$location,]
  coord_eq2 <- df2[df2$location %in% df1$location,]
  
  #inner_joined due to some locations having different number of samples for each type of soil
  #this results in duplication in some cases
  temp_df <- inner_join(coord_eq1, coord_eq2, by = "location")
  temp_df1 <- temp_df[1:4]
  temp_df2 <- temp_df[c(1,5:7)]
  
  cor_mat <- cor(x = temp_df1[2:4], y = temp_df2[2:4], method = "spearman")
  
  corrplot(cor_mat)
}

The above code works perfectly, however step 3 results in duplication of observations and probably in a bias. If instead I use the average values, there is no need for the biased pairing using inner_join and the correlation result is very different.

df_mean <- df_wider %>%
  group_by(location) %>%
  summarise_all(mean, na.rm = T)

bicorplot(df_wider, 'type1', 'type3')
bicorplot(df_mean, 'type1', 'type3')

This is the first output: corr_bias (df_wider)

And this is the second output (using the mean values): corr_mean (df_mean)

So, is any of these methods correct? If not, what should the approach be in this case and why are they incorrect?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
  • 1
    Haven't completely thought this through - but at a glance, this looks like a task for a linear mixed effects model ...? – Where's my towel Apr 17 '23 at 09:22
  • 2
    To get better, faster help I would emphasize the *minimal* in [minimal reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) - this is a lot to parse through to find a specific **coding** question. Try editing down to the smallest amount of information/code needed. Just trying to help you get the help you need! – jpsmith Apr 17 '23 at 13:32
  • 2
    +1. It also seems like this is more of a stats question, so maybe better try [stats.stackexchange](https://stats.stackexchange.com/) – Where's my towel Apr 17 '23 at 13:50
  • You guys are probably right, I will ask there! Thanks! – Igor Salerno Filgueiras Apr 17 '23 at 18:58

0 Answers0