I'm not sure this is the right place to ask this kind of question, but here it goes…
I would like to do statistical analyses with data I have in R. However, that data is scattered in several data.frame
(df
) objects of varying sizes.
All the df
objects have the same structure such as:
Date PDD.Open PDD.High PDD.Low PDD.Close PDD.Volume PDD.Adjusted
1 2018-07-26 26.50 27.45 25.00 26.70 43213200 26.70
2 2018-07-27 27.54 27.54 23.21 24.60 19923300 24.60
3 2018-07-30 23.31 23.90 21.88 22.50 13967700 22.50
4 2018-07-31 22.20 22.71 19.62 22.59 13709600 22.59
5 2018-08-01 19.38 21.40 18.62 20.31 19339000 20.31
6 2018-08-02 19.27 20.18 19.03 19.66 9268500 19.66
For example, as of today (2021.10.23) MPNGF
has 734 observations, NKLA
has 850, PDD
has 818, and RMO
648.
When I try to correlate, say, PDD$PDD.Open
with RMO$RMO.Open
, I get an Error in cor(PDD$PDD.Open, RMO$RMO.Open) : incompatible dimensions
.
Here below is a hopefully reproducible code example to illustrate the issue:
library(quantmod)
# Load various ticker data
tickers <- c("NKLA", "MPNGF", "PDD", "RMO")
getSymbols.yahoo(tickers, auto.assign = TRUE, env = globalenv(),
from = "2000-01-01")
# Close all Internet connections as a precaution
# https://stackoverflow.com/a/52758758/2950721
closeAllConnections()
# Find xts objects
xtsObjects <- names(which(unlist(eapply(.GlobalEnv, is.xts))))
# Convert xts to data.frame
# https://stackoverflow.com/a/69246047/2950721
for (i in seq_along(xtsObjects)) {
assign(xtsObjects[i], fortify.zoo(get(xtsObjects[i])))
}
# Change name of 1st column to Date of converted xts objects
# https://stackoverflow.com/a/69292036/2950721
for (i in seq_along(xtsObjects)) {
tmp <- get(xtsObjects[i])
colnames(tmp)[colnames(tmp) == "Index"] <- "Date"
assign(xtsObjects[i], tmp)
}
remove(i, tickers, tmp, xtsObjects)
> cor(PDD$PDD.Open, RMO$RMO.Open)
Error in cor(PDD$PDD.Open, RMO$RMO.Open) : incompatible dimensions
Additional important details:
- The quantity of
df
objects increases (i.e., new tickers/symbols are added regularly and their data downloaded). - The existing
df
objects are updated on a regular (sometimes daily) basis (i.e., last values added to each downloaded ticker).
My questions:
- With the above in mind, how should the data be "handled" (for lack of a better term) so that any type of statistical analyses can be carried out with it (including across more than 2
df
objects)? - Should the data be merged?
- Is there a best practice?
- What code should be used to do the merge (keeping in mind that it needs to be updated)?
Thanks in advance.
Systems used:
- R version: 4.1.1 (2021-08-10)
- RStudio version: 1.4.1717
- OS: macOS Catalina version 10.15.7 and macOS Big Sur version 11.6