5

I am using R off and on as a "backend" to Python and thus need to occassionaly import dataframes from R into Python; but I can't figure out how to import an R data.frame as a Pandas DataFrame.

For example if I create a dataframe in R

rdf = data.frame(a=c(2, 3, 5), b=c("aa", "bb", "cc"), c=c(TRUE, FALSE, TRUE))

and then pull it into Python using rmagic with

%Rpull -d rdf

I get

array([(2.0, 1, 1), (3.0, 2, 0), (5.0, 3, 1)], 
      dtype=[('a', '<f8'), ('b', '<i4'), ('c', '<i4')])

I don't know what this is, and it's certainly not the

pd.DataFrame({'a': [2, 3, 5], 'b': ['aa', 'bb', 'cc'], 'c': [True, False, True]})

that I would expect.

The only thing that comes close to working for me is to use use a file to transfer the dataframe by writing in R

write.csv(data.frame(a=c(2, 3, 5), b=c("aa", "bb", "cc"), c=c(TRUE, FALSE, TRUE)), file="TEST.csv")

and then reading in Python

pd.read_csv("TEST.csv")

though even this approach produces an additional column: "Unnamed: 0".

What is the idiom for importing an R dataframe into Python as a Pandas dataframe?

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
orome
  • 45,163
  • 57
  • 202
  • 418
  • possible duplicate of [Pandas - how to convert r dataframe back to pandas?](http://stackoverflow.com/questions/20630121/pandas-how-to-convert-r-dataframe-back-to-pandas) – joris Mar 29 '14 at 19:32
  • See also this comment from @lgautier: http://stackoverflow.com/questions/15209636/convert-to-r-dataframe-module-object-has-no-attribute#comment21457740_15209636 – joris Mar 29 '14 at 19:33
  • @joris: Not a duplicate. Look closely at the question. This about dataframes created **in R**. – orome Mar 29 '14 at 19:41

2 Answers2

6

First: array([(2.0, 1, 1), (3.0, 2, 0), (5.0, 3, 1)], dtype=[('a', '<f8'), ('b', '<i4'), ('c', '<i4')]). That is a numpy structured array. http://docs.scipy.org/doc/numpy/user/basics.rec.html/. You can easily convert it to pandas DF by using pd.DataFrame:

In [65]:

from numpy import *
print pd.DataFrame(array([(2.0, 1, 1), (3.0, 2, 0), (5.0, 3, 1)], dtype=[('a', '<f8'), ('b', '<i4'), ('c', '<i4')]))
   a  b  c
0  2  1  1
1  3  2  0
2  5  3  1

b column is coded (as if factor()'ed in R), c column was converted from boolean to int. a was converted from int to float ('<f8', actually I found that unexpected)

2nd, I think pandas.rpy.common is the most convenient way of fetching data from R: http://pandas.pydata.org/pandas-docs/stable/r_interface.html (It is probably too brief, so I will add another example here):

In [71]:

import pandas.rpy.common as com
DF=pd.DataFrame({'val':[1,1,1,2,2,3,3]})
r_DF = com.convert_to_r_dataframe(DF)
print pd.DataFrame(com.convert_robj(r_DF))
   val
0    1
1    1
2    1
3    2
4    2
5    3
6    3

Finally, the Unnamed: 0 column is the index column. You can avoid it by providing index_col=0 to pd.read_csv()

CT Zhu
  • 52,648
  • 17
  • 120
  • 133
  • The second approach does nothing different from the first — i.e., wrapping in `pd.DataFrame(com.convert_robj(rdf))` is no different from `pd.DataFrame(rdf)`. The first approach changes all the values in bizarre ways. It looks like the file export/import approach is the only way that works? – orome Mar 29 '14 at 19:21
  • And: Thanks for the `index_col=0` tip. That definitely makes it import/export the preferred approach, unless I'm missing something. – orome Mar 29 '14 at 19:23
  • You are right, the only way I found that `string` data type is preserved is `%R z = c('a',1,'c')` and then `%Rpull z`. Putting it to a `data.frame` will always resulting it being converted to `int32/64`. A side note, on my machine the 2nd approach is slightly different from the 1st in that the resulting `DataFrame` has `int64` for all its columns, rather than a mixed bag of `dtypes`. – CT Zhu Mar 29 '14 at 19:55
  • 1
    For the last one, on the `R` side, you can also do `write.csv(......, row.names=FALSE)` instead. – CT Zhu Mar 29 '14 at 20:01
2

What about this (see pandas 0.13.1 documentation):

%load_ext rmagic
%R rdf = data.frame(a=c(2, 3, 5), b=c("aa", "bb", "cc"), c=c(TRUE, FALSE, TRUE))

import pandas.rpy.common as com

print com.load_data('rdf')
   a   b      c
1  2  aa   True
2  3  bb  False
3  5  cc   True
masat
  • 350
  • 1
  • 10
  • rdf is your r data frame, rdf = data.frame(a=c(2, 3, 5), b=c("aa", "bb", "cc"), c=c(TRUE, FALSE, TRUE)) – masat Mar 30 '14 at 13:25