3

I have a folder with more than 1000 rda time series files of stock data. Below is the sample code using which I saved my time series(xts) files in rda. I used rda/rdata instead of csv because saving and loading of files are fast and data compression is also very good in rda compared to csv.

library(quantmod)
AAPL <- getSymbols("AAPL",auto.assign=FALSE)
save(AAPL,file="/home/user/folder/AAPL.rda")

           AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
2007-01-03     86.29     86.58    81.90      83.80   309579900      10.96015
2007-01-04     84.05     85.95    83.82      85.66   211815100      11.20341
2007-01-05     85.77     86.20    84.40      85.05   208685400      11.12363
2007-01-08     85.96     86.53    85.28      85.47   199276700      11.17857
2007-01-09     86.45     92.98    85.15      92.57   837324600      12.10717
2007-01-10     94.75     97.80    93.45      97.00   738220000      12.68657

I use these files for many of my data analysis experiments in R. But now I am slowly migrating to python(with pandas) because its a general purpose language. Instead of downloading all stock data once more, is there a way to convert my current rda xts files to python pandas native files(h5 or pickle which ever is the best format). How can I do it?

EDIT

This is what I did in python

import rpy2.robjects as robjects
import pandas.rpy.common as com
import pandas as pd

robj=robjects.r['load']("AAPL.rda")


for sets in robj:
    myRData = com.load_data(sets)
    # convert to DataFrame
    if not isinstance(myRData, pd.DataFrame):
        myRData = pd.DataFrame(myRData)

print(myRData)

The output is

     AAPL.Open  AAPL.High   AAPL.Low  AAPL.Close  AAPL.Volume  AAPL.Adjusted
1.0  86.289999  86.579999  81.899999   83.800002  309579900.0      10.960147
2.0  84.050001  85.949998  83.820003   85.659998  211815100.0      11.203415
3.0  85.770000  86.199997  84.400002   85.049997  208685400.0      11.123633
4.0  85.959998  86.529998  85.280003   85.470000  199276700.0      11.178565
5.0  86.450003  92.979999  85.150000   92.570003  837324600.0      12.107169

it converted it into non-time series dataset in python. How should I convert it to time series?

EDIT 2:

After many searching and tinkering I have come this far. I tried to convert the UTC variable in my rda file to local time

import rpy2.robjects as robjects
import pandas.rpy.common as com
import pandas as pd
import numpy as np

robj=robjects.r['load']("AAPL.rda")

myRData=None
for sets in robj:
   myRData = com.load_data(sets)
   # convert to DataFrame
   if not isinstance(myRData, pd.DataFrame):
       myRData = pd.DataFrame(myRData)

myRData.head(10)      
ts=np.array(robjects.r('attr(AAPL,"index")')).astype(int)

#changing index
myRData.index=pd.to_datetime(ts, utc=True, format='%Y-%m-%d')

myRData.tail(10)

Now the problem is the converted local time index is not properly formatted. The tail part should contain time series of recent dates instead its stuck in 1970.

                                     AAPL.Close  AAPL.Volume  AAPL.Adjusted  
1970-01-01 00:00:01.476144+00:00     116.300003   64041000.0     116.300003  
1970-01-01 00:00:01.476230400+00:00  117.339996   37586800.0     117.339996  
1970-01-01 00:00:01.476316800+00:00  116.980003   35192400.0     116.980003  
1970-01-01 00:00:01.476403200+00:00  117.629997   35652200.0     117.629997  
1970-01-01 00:00:01.476662400+00:00  117.550003   23624900.0     117.550003  
1970-01-01 00:00:01.476748800+00:00  117.470001   24553500.0     117.470001  
1970-01-01 00:00:01.476835200+00:00  117.120003   20034600.0     117.120003  
1970-01-01 00:00:01.476921600+00:00  117.059998   24125800.0     117.059998  
1970-01-01 00:00:01.477008+00:00     116.599998   23192700.0     116.599998  
1970-01-01 00:00:01.477267200+00:00  117.650002   23311700.0     117.650002  
Eka
  • 14,170
  • 38
  • 128
  • 212

1 Answers1

2

After hours of editing and searching. I solved my problem this the resultant code. Any suggestion welcomed

import rpy2.robjects as robjects
import pandas.rpy.common as com
import pandas as pd
import numpy as np
from datetime import datetime

#loading external rda file
robj=robjects.r['load']("AAPL.rda")

myRData=None
for sets in robj:
   myRData = com.load_data(sets)
   # convert to DataFrame
   if not isinstance(myRData, pd.DataFrame):
       myRData = pd.DataFrame(myRData)

myRData.tail(10)    

#fetching UTC data from rda file  
ts=np.array(robjects.r('attr(AAPL,"index")')).astype(int)

#converting UTC to local time
d= np.array([])
for t in ts:
    s=datetime.utcfromtimestamp(t)
    d=np.append(s,d)

#sorting datetime 
d=np.sort(d, axis=0)

#changing index
myRData.index=pd.to_datetime(d)

myRData.tail(10)

Result

             AAPL.Open   AAPL.High    AAPL.Low  AAPL.Close  AAPL.Volume  \
2016-10-11  117.699997  118.690002  116.199997  116.300003   64041000.0   
2016-10-12  117.349998  117.980003  116.750000  117.339996   37586800.0   
2016-10-13  116.790001  117.440002  115.720001  116.980003   35192400.0   
2016-10-14  117.879997  118.169998  117.129997  117.629997   35652200.0   
2016-10-17  117.330002  117.839996  116.779999  117.550003   23624900.0   
2016-10-18  118.180000  118.209999  117.449997  117.470001   24553500.0   
2016-10-19  117.250000  117.760002  113.800003  117.120003   20034600.0   
2016-10-20  116.860001  117.379997  116.330002  117.059998   24125800.0   
2016-10-21  116.809998  116.910004  116.279999  116.599998   23192700.0   
2016-10-24  117.099998  117.739998  117.000000  117.650002   23311700.0  
Eka
  • 14,170
  • 38
  • 128
  • 212