How would I automate computing correlations within a tibble for various countries and store effectively?

Question

Somewhat of a beginner in R and I am working on a relatively large dataset (for me at least) of around 500,000 rows.

I am trying to find the correlation between variables for various countries (measuring the effects of bullying specifically) for the PISA dataset (education based survey).

I am able to compute the correlation matrix for countries on a case by case basis.

I wanted to record the correlation between two variables (so not the entire matrix necessarily) across all these countries - automating this and storing the results all in a tibble so that I don’t need to spend time doing this manually.

correl_countries = tibble()

for (each in list_countries){
  countries_bullying %>% #tibble subset of the original data 
    filter(CNTRYID == each)%>%
    select(reading_score, bullied_index)%>%
    correl = cor(use = "pairwise.complete.obs") #something to store the correlation values
    correl_countries %>% add_row(x = each, y = correl) #wanted to add these results to a tibble
}

Currently nothing seems to happen and I receive this error.

Error in is.data.frame(x) : argument "x" is missing, with no default

It may have something to do with the fact that "pairwise.complete.obs" generates a correlation matrix and not a single vector.

Grateful for your recommendations!

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't want your actual data, just something representative that we can use for testing. — MrFlick, Dec 19 '20 at 02:04
You could also try the `RALSA` package which is designed for large-scale assessment data and has a graphical user interface: https://cran.r-project.org/package=RALSA For guides on how to use it, see here: http://ralsa.ineri.org/user-guide/ — panman, Nov 04 '21 at 14:25

score 2 · Accepted Answer · answered Dec 19 '20 at 09:41

You don't really need the loop here, the tidyverse has got you covered... The following returns a tibble with 2 columns: CNTRYID and correl:

library(tidyverse)

# get only the correlations
countries_bullying %>%
  group_by(CNTRYID) %>%
  summarise(correl = cor(reading_score, bullied_index, use = "pairwise.complete.obs"))

score 1 · Answer 2 · answered Dec 19 '20 at 08:29

New user here- somehow can't place comments. If I understood correctly, you want to compute the correlation between 2 variables, per country, and store it in a separate tibble. Replace "df" with the name of your dataset, and "countries" with the variable in your dataset containing all the countries. For large datasets, a more elegant solution is likely available (i.e subsetting less variables each loop).

correl_countries <- c()
vec <- unique(df$countries)
for (i in 1:length(vec)) {
    new <- df[df$countries == vec[i],]
    correl_countries[i] <- cor(new$var1, new$var2)
}
tibble(vec, correl_countries)

How would I automate computing correlations within a tibble for various countries and store effectively?

2 Answers2