16

I have some .rda files that I need to access with Python. My code looks like this:

import rpy2.robjects as robjects
from rpy2.robjects import r, pandas2ri

pandas2ri.activate()
df = robjects.r.load("datafile.rda")
df2 = pandas2ri.ri2py_dataframe(df)

where df2 is a pandas dataframe. However, it only contains the header of the .rda file! I have searched back and forth. None of the solutions proposed seem to be working.

Does anyone have an idea how to efficiently convert an .rda dataframe to a pandas dataframe?

Parfait
  • 104,375
  • 17
  • 94
  • 125
Matina G
  • 1,452
  • 2
  • 14
  • 28
  • 1
    Try saving from R an .rds ([single object](https://stackoverflow.com/a/21370351/1422451)) file. – Parfait Dec 15 '17 at 16:22
  • Thank you for this proposal. However, I have no control over le generation of the .rda files, and converting them to .rds before loading them with python will be extremely inefficient. Any other suggestions? – Matina G Dec 18 '17 at 13:57
  • Actually not really, simply load the .rda files in an R environment and run the `eapply` or `mget` to save every global environ object into individual rds files. – Parfait Dec 18 '17 at 15:06

3 Answers3

13

Thank you for your useful question. I tried the two ways proposed above to handle my problem. For feather, I faced this issue:

pyarrow.lib.ArrowInvalid: Not a Feather V1 or Arrow IPC file

For rpy2, as mentioned by @Orange: "pandas2ri.ri2py_dataframe does not seem to exist any longer in rpy2 version 3.0.3" or later.

I searched for another workaround and found pyreadr useful for me and maybe for those who are facing the same problems as I am: https://github.com/ofajardo/pyreadr

Usage: https://gist.github.com/LeiG/8094753a6cc7907c716f#gistcomment-2795790

pip install pyreadr
import pyreadr

result = pyreadr.read_r('/path/to/file.RData') # also works for Rds, rda

# done! let's see what we got
# result is a dictionary where keys are the name of objects and the values python
# objects
print(result.keys()) # let's check what objects we got
df1 = result["df1"] # extract the pandas data frame for object df1
Hoa Nguyen
  • 470
  • 6
  • 15
  • I tried this on a `.rda` file and got this error: `pyreadr.custom_errors.LibrdataError: The file is compressed using an unsupported compression scheme` -- any workarounds? – Marc Maxmeister Jun 12 '20 at 02:20
  • Hi @MarcMaxmeister, is it possible to share the file? Actually, that package still has some limitations: https://github.com/ofajardo/pyreadr. I converted `rda` files from this repository: https://github.com/clauswilke/dviz.supp/tree/master/data and it worked quite well (41 out of 48 are successfully converted). My converted files were saved as `tsv` format here: https://github.com/nguyenhoa93/data-visualization-practice/tree/master/data/resources. – Hoa Nguyen Jun 12 '20 at 22:36
  • The .rda file is too big to share. I think gigabytes. It was a genomics database used by a defunct R library. – Marc Maxmeister Jun 16 '20 at 03:29
  • 1
    I figured out a fix - I had to install R, then save to feather, and then load `from_feather` in python Pandas. – Marc Maxmeister Jun 16 '20 at 03:30
  • Note: If interested in using rpy2 with Arrow, there is this - https://github.com/rpy2/rpy2-arrow – lgautier Jan 23 '21 at 20:50
  • 1
    The solution still works even though pyreadr package is a little dated. – Ashok K Harnal Mar 02 '23 at 05:19
5

As mentioned, consider converting the .rda file into individual .rds objects using R's mget or eapply for building Python dictionary of dataframes.

RPy2

import os
import pandas as pd

import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri    
from rpy2.robjects.packages import importr

pandas2ri.activate()

base = importr('base')
base.load("datafile.rda")    
rdf_List = base.mget(base.ls())

# ITERATE THROUGH LIST OF R DFs 
pydf_dict = {}

for i,f in enumerate(base.names(rdf_List)):
    pydf_dict[f] = pandas2ri.ri2py_dataframe(rdf_List[i])

for k,v in pydf_dict.items():
    print(v.head())
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • 1
    Why do you need to write out as rds and load back in? I am new to rpy2 but in your "python combined" code you could seemingly run it as far as the line `dfList = base.mget(base.ls())`. Then use a `for` loop over the elements of `base.names(dfList)` to populate `df_dict` with the command `df_dict[i] = pandas2ri.ri2py_dataframe(robjects.globalenv[i])`. At least, that seemed to work for me... – Nick May 24 '18 at 16:00
  • You are in fact correct, @Nick. Given the five month old question, answer can be streamlined a bit without saving .rds's to disk. I think I got caught up in the weeds and did not see whole picture. Hindsight is always 20-20 right? – Parfait May 24 '18 at 16:15
  • `pandas2ri.ri2py_dataframe` does not seem to exist any longer in rpy2 version 3.0.3. – 0range Jun 27 '19 at 19:36
5

You could try using the new feather library developed as a language agnostic dataframe to be used in either R or Python.

# Install feather
devtools::install_github("wesm/feather/R")

library(feather)
path <- "your_file_path"
write_feather(datafile, path)

Then install in python

$ pip install feather-format

And load in your datafile

import feather
path = 'your_file_path'
datafile = feather.read_dataframe(path)
dshkol
  • 1,208
  • 7
  • 23