0

I'm not sure this is the right place to ask this kind of question, but here it goes…

I would like to do statistical analyses with data I have in R. However, that data is scattered in several data.frame (df) objects of varying sizes.

All the df objects have the same structure such as:

        Date PDD.Open PDD.High PDD.Low PDD.Close PDD.Volume PDD.Adjusted
1 2018-07-26    26.50    27.45   25.00     26.70   43213200        26.70
2 2018-07-27    27.54    27.54   23.21     24.60   19923300        24.60
3 2018-07-30    23.31    23.90   21.88     22.50   13967700        22.50
4 2018-07-31    22.20    22.71   19.62     22.59   13709600        22.59
5 2018-08-01    19.38    21.40   18.62     20.31   19339000        20.31
6 2018-08-02    19.27    20.18   19.03     19.66    9268500        19.66

For example, as of today (2021.10.23) MPNGF has 734 observations, NKLA has 850, PDD has 818, and RMO 648.

When I try to correlate, say, PDD$PDD.Open with RMO$RMO.Open, I get an Error in cor(PDD$PDD.Open, RMO$RMO.Open) : incompatible dimensions.

Here below is a hopefully reproducible code example to illustrate the issue:

library(quantmod)

# Load various ticker data
tickers <- c("NKLA", "MPNGF", "PDD", "RMO")
getSymbols.yahoo(tickers, auto.assign = TRUE, env = globalenv(), 
                 from = "2000-01-01")

# Close all Internet connections as a precaution
# https://stackoverflow.com/a/52758758/2950721
closeAllConnections()

# Find xts objects
xtsObjects <- names(which(unlist(eapply(.GlobalEnv, is.xts))))

# Convert xts to data.frame
# https://stackoverflow.com/a/69246047/2950721
for (i in seq_along(xtsObjects)) {
  assign(xtsObjects[i], fortify.zoo(get(xtsObjects[i])))
}

# Change name of 1st column to Date of converted xts objects
# https://stackoverflow.com/a/69292036/2950721
for (i in seq_along(xtsObjects)) {
  tmp <- get(xtsObjects[i])
  colnames(tmp)[colnames(tmp) == "Index"] <- "Date"
  assign(xtsObjects[i], tmp)
}
remove(i, tickers, tmp, xtsObjects)


> cor(PDD$PDD.Open, RMO$RMO.Open)
Error in cor(PDD$PDD.Open, RMO$RMO.Open) : incompatible dimensions

Additional important details:

  • The quantity of df objects increases (i.e., new tickers/symbols are added regularly and their data downloaded).
  • The existing df objects are updated on a regular (sometimes daily) basis (i.e., last values added to each downloaded ticker).

My questions:

  • With the above in mind, how should the data be "handled" (for lack of a better term) so that any type of statistical analyses can be carried out with it (including across more than 2 df objects)?
  • Should the data be merged?
  • Is there a best practice?
  • What code should be used to do the merge (keeping in mind that it needs to be updated)?

Thanks in advance.


Systems used:

  • R version: 4.1.1 (2021-08-10)
  • RStudio version: 1.4.1717
  • OS: macOS Catalina version 10.15.7 and macOS Big Sur version 11.6
pdeli
  • 436
  • 3
  • 13

1 Answers1

0

You may calculate the correlations on matching dates.

dis <- as.Date(intersect(PDD$Date, RMO$Date), origin='1970-01-01')
cat('Nobs used\nPDD: ', length(dis)*100/nrow(PDD), 
    '%, RMO: ', length(dis)*100/nrow(RMO), '%', sep='')
# Nobs used
# PDD: 79.2176%, RMO: 100%
cor(PDD[PDD$Date %in% dis, 'PDD.Open'], RMO[RMO$Date %in% dis, 'RMO.Open'])
# [1] 0.4150031

If you rbind the data frames, you need unique names and might want to add a column that identifies the observations.

L <- mget(c("MPNGF", "NKLA", "PDD", "RMO"))  ## list data.frames
L <- Map(`[<-`, L, 'Name', value=names(L))  ## add 'Name' column
## set unique names
L <- lapply(L, setNames, gsub('.*(?>\\.)', '', names(L[[1]]), perl=TRUE))
DF <- `rownames<-`(do.call(rbind, L), NULL)  ## rbind

head(DF)
#         Date Open High Low Close Volume Adjusted  Name
# 1 2018-11-23  7.0  7.0 7.0   7.0    100      7.0 MPNGF
# 2 2018-11-26  7.0  7.0 7.0   7.0      0      7.0 MPNGF
# 3 2018-11-27  7.0  7.0 7.0   7.0      0      7.0 MPNGF
# 4 2018-11-28  7.0  7.0 7.0   7.0      0      7.0 MPNGF
# 5 2018-11-29  6.5  6.5 6.5   6.5    100      6.5 MPNGF
# 6 2018-11-30  6.5  6.5 6.5   6.5      0      6.5 MPNGF
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thanks for your answer jay.sf. Save for the ```Date``` column, each column is already uniquely identified (you can check the structure example I posted in my question). Do you think that the ```Date``` column should also be uniquely identified? If not, how to ```rbind``` the ```data.frame``` objects with the ```data.frame``` object having the oldest date? What if a new ticker is loaded at a later stage with even older dates, can it take precedence? – pdeli Oct 24 '21 at 12:36