I want to correlate some variables taking into account the sample origin, but don't have the same amount of rows for each subset. In the title I say it is unpaired data because there are types of samples with more observations of a certain category than others. This issue could be 'solved' by taking the mean of the category per type of sample but it feels that like that I am throwing information away, so what is the correct way to pair this data?
For example, imagine I am studying 3 types of soil, collect my materials from 4 locations, and the sample analysis results in 3 variables. I am interested in understanding how each var in each soil are associated with one another (for example increase in var1 in soil type1 results in a decrease in var2 in soil type3). The problem is that the number of samples for soil and location are not the same, in some cases there are locations without a specific type of soil so correlating the variables as in the example above is not possible (or am I missing something here?). As said above, if I just take the mean of each soil for each location, the data would then be paired and I would be able to remove rows that don't have info of a given soil for a given comparison. However, using the mean sounds like throwing information away. How should this problem be addressed? Consider the following sample data (df) for this discussion:
df <- structure(list(location = c("A", "A", "A", "A", "B", "B", "B","B", "C", "C", "C", "C", "C", "C", "D", "D"),
type_of_soil = c("type1", "type1", "type2", "type3", "type1", "type2", "type2", "type3", "type1", "type2", "type2", "type2", "type3", "type3", "type1", "type2"),
var1 = c(29, 7, 11, 20, 12, 15, 22, 27, 29, 32, 31, 23, 22, 20, 20, 15),
var2 = c(56, 92, 111, 15, 230, 23, 47, 50, 190, 10, 6, 2, 15, 73, 223, 50),
var3 = c(1000, 1643, 1242, 6352, 111, 3552, 837, 5612, 999, 572, 127, 593, 3232, 524, 41, 2)),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -16L))
First I manipulate the data so that each combination of var and soil are considered one variable. This results in the dataframe going from 6 columns (3 vars of interest) to 11 columns (9 vars of interest). This will result in every row containing NAs
and as such the argument use = "pairwise.complete.obs"
of the cor
function cannot be used anymore.
df$ID <- 1:nrow(df) #one ID for each original sample
df_longer <- pivot_longer(data = df, cols = -c(1,2,6) , names_to = "var", values_to = "vals")
df_cor <- unite(df_longer, var, type_of_soil, sep = '_', col = 'var_type')
df_wider <- pivot_wider(df_cor, names_from = var_type, values_from = vals)
In order to address this I wrote a function that receives df_wider, two strings containing soil types ('type1', 'type2', or 'type3'), and then:
- Subsets the data into two datasets (each containing the variables from a specific soil type);
- Keeps in each dataset only locations that are present in both of the new datasets;
- Inner_joins the datasets, pairing the observations in every combination possible;
- Re-splits the datasets into each type of soil again (so as to correlate only x.vars with y.vars);
- Plots a
corrplot()
of both datasets
bicorplot <- function(df_forcor, soil1, soil2){
#cols that reference the soil of interest
coord1 <- grep(soil1, colnames(df_forcor))
coord2 <- grep(soil2, colnames(df_forcor))
#remove from each subseted df the samples that don't have that type of soil
df1 <- na.omit(df_forcor[,c(1, coord1)])
df2 <- na.omit(df_forcor[,c(1, coord2)])
#retain only samples that have both type_of_soil
coord_eq1 <- df1[df1$location %in% df2$location,]
coord_eq2 <- df2[df2$location %in% df1$location,]
#inner_joined due to some locations having different number of samples for each type of soil
#this results in duplication in some cases
temp_df <- inner_join(coord_eq1, coord_eq2, by = "location")
temp_df1 <- temp_df[1:4]
temp_df2 <- temp_df[c(1,5:7)]
cor_mat <- cor(x = temp_df1[2:4], y = temp_df2[2:4], method = "spearman")
corrplot(cor_mat)
}
The above code works perfectly, however step 3 results in duplication of observations and probably in a bias. If instead I use the average values, there is no need for the biased pairing using inner_join
and the correlation result is very different.
df_mean <- df_wider %>%
group_by(location) %>%
summarise_all(mean, na.rm = T)
bicorplot(df_wider, 'type1', 'type3')
bicorplot(df_mean, 'type1', 'type3')
And this is the second output (using the mean values):
So, is any of these methods correct? If not, what should the approach be in this case and why are they incorrect?