0

Assuming I have a data frame like the below (actual data frame has million observations). I am trying to look for correlation between signal column and other net returns columns group by various values of signal_up column.

I have tried “dplyr” library and combination of functions “group_by” and “summarize”. However, I am only able to get correlation between two columns and not the multiple columns.

library(dplyr) 
df %>%   
   group_by(Signal_Up) %>%   
   summarize (COR=cor(signal, Net_return_at_t_plus1))

Data and desired result are given below.

Data

enter image description here

Desired Result

Correlation between "signal" Vs ["Net_return_at_t_plus1", "Net_return_at_t_plus5", "Net_return_at_t_plus10"]

Group by "Signal_Up"

enter image description here

dc37
  • 15,840
  • 4
  • 15
  • 32
ClaudeTi
  • 13
  • 1
  • 5

1 Answers1

1

Maybe you can try to use summarise_at to perform the correlation over several columns.

Here, I took the iris dataset as example:

library(dplyr)
iris %>% group_by(Species) %>%
  summarise_at(vars(Sepal.Length:Petal.Length), ~cor(Petal.Width,.))

# A tibble: 3 x 4
  Species    Sepal.Length Sepal.Width Petal.Length
  <fct>             <dbl>       <dbl>        <dbl>
1 setosa            0.278       0.233        0.332
2 versicolor        0.546       0.664        0.787
3 virginica         0.281       0.538        0.322

For your dataset, you should try something like:

library(dplyr)
df %>% group_by(Signal_Up) %>%
  summarise_at(vars(Net_return_at_t_plus1:Net_return_at_t_plus1), ~cor(signal,.))

Does it answer your question ?


NB: It is easier for people to try to solve your issue if you are providing reproducible example that they can easily copy/paste instead of adding it as an image (see: How to make a great R reproducible example)

dc37
  • 15,840
  • 4
  • 15
  • 32
  • Thanks very much @dc37 for the solution. It works perfectly fine. I will provide reproducible example instead of the image next time. Thanks for pointing it. – ClaudeTi Mar 23 '20 at 16:20
  • Thanks @dc37, its done. Just want to ask whether it’s possible to create the categories/ groups on the fly for the variable "signal"? Currently, I had to create a variable "Signal_Up" (by round up values from "signal" variable) in order to categorize the main variable "signal" (categories: -4, -3, … , 3, 4, 5,..). Something like - Groupby "signal" by [A, B, C] WHERE A = signal > 1 , B = -1 > signal > 1, C = signal < -1. – ClaudeTi Mar 24 '20 at 13:46
  • Sorry, your question is not really clear. And without a reproducible example, it will be hard to address your issue – dc37 Mar 24 '20 at 16:06
  • Thanks, I will post that as a separate question altogether. – ClaudeTi Mar 30 '20 at 16:56