0

I want to calculate a Pearson correlation between several columns. The solution JasonAizkalns posted in this thread is very useful for me.

  df %>%
  select_if(is.numeric) %>%
  group_by(year) %>%
  group_map(~ correlate(.x))

Now I'm wondering two things:

  1. How can I get p-Values?
  2. Why are some correlation coefficients marked in red? I have not found anything about it in the documentation. Are these already the significant correlations? If yes, which significance level is used?

I am searching for an extension as simple as possible, without having to use a completely different method.

Thanks for any tips!

Edit 1 (11/28/22): Because my grouping variable ("trainingsmodus") is a character variable and I get the following error message, I have adapted my code.

Error in group_by(): ! Must group by variables found in .data. ✖ Column trainingsmodus is not found. Backtrace:

  1. ... %>% ...
  2. dplyr:::group_by.data.frame(., trainingsmodus)

My adapted code:

df %>%
  select_if(is.character) %>%
  group_by(year) %>%
  group_map(~ correlate(.x)) %>%
  add_column(year)

Even if I create the grouping variable as a numeric variable, the results of both groups are exactly identical, and this makes no sense. Does anyone have a tip on how I can correct the code?

Edit 2 (11/28/22) Repro of my df and the code:

df <- data.frame(year = c("lorem", "ipsum", "lorem", "ipsum"),    
             var1 = 4:7,
             var2 = 5:8,
             var3 = 6:9,
             var4 = 7:10)

library(rstatix)

df %>%
      select_if(is.character) %>%
      group_by(year) %>%
      group_map(~ cor_test(df,
                vars = c("var1", "var2", "var3", "var4"), 
                vars2 = c("var1", "var2", "var3", "var4") %>%
      filter(is.finite(statistic))) 
 
Phil
  • 7,287
  • 3
  • 36
  • 66
formatc
  • 35
  • 5
  • [See here](https://stackoverflow.com/q/5963269/5325862) on making a reproducible example that is easier for folks to help with. The second part of your question (about a column not being found) deals with data none of us can see, but it seems like now you're trying to get correlations of character columns – camille Nov 28 '22 at 15:25
  • Re: EDIT - your error is due to you calling a variable that doesn't exist in your data frame. This is outside of the scope of your initial question. – Phil Nov 28 '22 at 15:31
  • I edited my initial post, a repro is now available. Sorry for not doing that in the beginning. – formatc Nov 28 '22 at 18:19
  • Your edited code doesn't make sense. You are only keeping character variables (in this case, it's just year), and then group by that year variable, leaving you with nothing else. What is the purpose? – Phil Nov 28 '22 at 19:37
  • I have an experimental group and a control group. The grouping variable is (as in the example) a character variable. Now I want to calculate the correlations of 20 items a.) for the whole sample (is not the topic here) as well as b.) the correlations for each of the two experimental conditions individually. I am interested in which items are correlated differently in the two groups. – formatc Nov 29 '22 at 06:46
  • Remove the `select()` line if you don't want to remove any variables. – Phil Nov 29 '22 at 15:16

1 Answers1

2

How can I get p-Values?

correlate() doesn't provide this information, so you'd need to use another tool. The rstatix package has a function, cor_test(), that can be used instead:

library(tibble)
library(dplyr)
library(rstatix)

df <- tribble(
  ~year, ~V1, ~V2, ~V3, ~misc_var,
  2018,   5,   6,   5,       "a",
  2018,   4,   6,   4,       "b",
  2018,   3,   2,   3,        NA,
  2013,   5,   8,   2,       "4",
  2013,   6,   3,   8,       "8",
  2013,   4,   7,   5,        NA
)

df |>
  select(where(is.numeric)) |>
  group_by(year) |>
  group_map(~ cor_test(.x, vars = c("V1", "V2", "V3"),
                       vars2 = c("V1", "V2", "V3")) |> 
              filter(is.finite(statistic)) |>
              add_column(.y))

[[1]]
# A tibble: 7 × 7
  var1  var2    cor    statistic             p method   year
  <chr> <chr> <dbl>        <dbl>         <dbl> <chr>   <dbl>
1 V1    V2    -0.76       -1.15  0.454         Pearson  2013
2 V1    V3     0.5         0.577 0.667         Pearson  2013
3 V2    V1    -0.76       -1.15  0.454         Pearson  2013
4 V2    V2     1    67108864     0.00000000949 Pearson  2013
5 V2    V3    -0.94       -2.89  0.212         Pearson  2013
6 V3    V1     0.5         0.577 0.667         Pearson  2013
7 V3    V2    -0.94       -2.89  0.212         Pearson  2013

[[2]]
# A tibble: 4 × 7
  var1  var2    cor statistic     p method   year
  <chr> <chr> <dbl>     <dbl> <dbl> <chr>   <dbl>
1 V1    V2     0.87      1.73 0.333 Pearson  2018
2 V2    V1     0.87      1.73 0.333 Pearson  2018
3 V2    V3     0.87      1.73 0.333 Pearson  2018
4 V3    V2     0.87      1.73 0.333 Pearson  2018

Why are some correlation coefficients marked in red?

By default, tibbles display negative or NA values in red to make them easier to notice.

Phil
  • 7,287
  • 3
  • 36
  • 66
  • Great, thanks! How do I know which group "[[1]]" stands for? P.S.: Unfortunately, I seem to have a problem with my grouping variable (see revised entry post). Can you please help again? – formatc Nov 28 '22 at 11:01
  • How do I know which group "[[1]]" stands for? `group_map` can use a function with two arguments, conventionally named`.x` and `.y`. `.x` contains the data for the current group, `.y` contains a single row containing the columns that define the current group. You can bind `.x` and `.y`. Alternatively add `.keep = TRUE` to your call to `group_map`. It's all in the online doc, which is almost always a useful read. – Limey Nov 28 '22 at 11:22
  • I've edited the code to include the years in the output. – Phil Nov 28 '22 at 15:16
  • Limey, thanks for the reference to the documentation, which I have also read. Unfortunately, however, this is not a guarantee to understand a procedure or a problem and I am not getting ahead. Thank you, Phil, for editing. – formatc Nov 28 '22 at 18:17