0

I have tried to find an answer for this everywhere and there is a lot of stock correlation answers but nothing revolving around a single point.

My goal is to find which 5 stocks had the highest correlation to a specific stock ID. The dataset is split weekly over 10 years.

I have tried splitting the datasets into two: 1) dataset without the stock I need 2) dataset with the specific stock.

I know I need to do a weekly comparison of all the stocks to this specific stock and somehow find which was most correlated over the timeframe.

It is roughly clear in my head; however, I have been stuck at the implementation because I am not sure how to group the weeks and compare it to a single week and so on.

Ultimate goal: compare all the stocks of the first week to the first line of the second table and so on.

Many thanks!

PS: This isn't a university assignment, I am just trying to improve my coding skills

  • It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Pictures of data aren't very helpful because we can't copy/paste the data in to R. – MrFlick Feb 24 '20 at 17:24
  • Dear @MrFlick, Apologies, it is my first time posting here. I have attach some of the code i used to split up the data to get the stock I want. I hope this what you were asking for, I am not sure how to give you a sample dataset, please let me know and I will do it right away. – Nima Ghaedi Feb 24 '20 at 17:44
  • One way to provide your data would be to paste the `dput()` output of your data set into your question: `dput(stocks)`. If you prefer not to share the full data set you can use `head()` to only share the first few rows: `dput(head(stocks, 20))`. – Till Feb 24 '20 at 17:49
  • @Till Thank you so much! I have reedited my post to include the code i have used + the dput function you've recommended – Nima Ghaedi Feb 24 '20 at 17:53
  • You included the `dput()` command in your code, but what we need is the output, to be able to help you. When you execute dput(stocks) you'll see output in the console starting with `structure(...` please copy and paste this code into your question, preferably in a separate code section. – Till Feb 24 '20 at 18:14

1 Answers1

1

These are the data transformation steps, to out your data set into a state where you can calculate correlations easily:

  1. Extract week and year from the Date column, to create a week identifier unique accross years (week() and year() are lubridate functions).
  2. Drop Date column.
  3. Make the dataset wide, so that the value for each stock is in a separate column (pivot_wider() is a tidyr function).

Code:

library(lubridate)
library(dplyr)
library(tidyr)

week_stocks <-
  stocks %>%
  mutate(Week = paste(year(Date), week(Date), sep = "_")) %>%
  select(StockID, Value, Week) %>%
  pivot_wider(names_from = StockID, values_from = Value)

After the transformation you use cor() to get correlations of all stocks. Since you are only interested in the correlations with one specific stock, you can use select() to drop all other stocks and their correlations.

cor(week_stocks[-1]) %>%
  as_tibble(rownames = "stockIDs") %>%
  select(stockIDs, `210449`)

Some general remarks:

  1. In the code in your question you use the attach() command, it is generally not recommended to use that for data frames, as it can lead to confusion and errors; see this blogpost.
  2. If you are looking into improving your R Skills check out the tidyverse and it’s packages. It is a great set of packages which share a concept for data science operations, that is very powerful and allows you to solve most data science problems with a small set of concise commands.
  3. When asking a question on StackOverflow it is usually good practice to include your data or at least parts of it. dput() provides the code needed for that. By pasting the output of dput() into your question other users can recreate your data in their environment and develop their answer while testing their code against your data. For example this is what the dput() output of my dummy dataset looks like, I used to create the code in my answer.

Code:

dput(stocks)

Output:

structure(list(StockID = c(16139, 210449, 210449, 210449, 210449, 
210449), Date = c("2015-09-11", "2015-09-11", "2015-09-18", "2015-09-25", 
"2015-10-02", "2015-10-09"), Value = c(0.055063, 0.01851903, 
0.01338099, 0.03982749, 0.04798457, 0.02433628)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))
Till
  • 3,845
  • 1
  • 11
  • 18