3

I create a pandas DataFrame:

import pandas as pd

df = pd.DataFrame(x.toarray(), columns = colnames)

Then I convert it to a R dataframe:

import pandas.rpy.common as com

rdf = com.convert_to_r_dataframe(df)

Under Windows with this configuration there are no problems:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: AMD64 Family 16 Model 4
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...

But when I execute it on Linux with this configuration:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...

I get this:

Traceback (most recent call last):
  File "app.py", line 232, in <module>
    clf.global_cl(df, df2)
  File "/home/uzer/app/util/clftool.py", line 202, in global_cl
    rdf = com.convert_to_r_dataframe(df)
  File "/home/uzer/app/venv/local/lib/python2.7/site-packages/pandas/rpy/common.py", line 324, in convert_to_r_dataframe
    value = VECTOR_TYPES[value_type](value)
KeyError: <type 'numpy.int64'>

It seems that VECTOR_TYPES does not have <type 'numpy.int64'> as key. But this is not true:

VECTOR_TYPES = {np.float64: robj.FloatVector,
            np.float32: robj.FloatVector,
            np.float: robj.FloatVector,
            np.int: robj.IntVector,
            np.int32: robj.IntVector,
            np.int64: robj.IntVector,
            np.object_: robj.StrVector,
            np.str: robj.StrVector,
            np.bool: robj.BoolVector}

So I printed variable type in convert_to_r_dataframe (in ../pandas/rpy/common.py):

for column in df:
    value = df[column]
    value_type = value.dtype.type
    print("value_type: %s") % value_type
    if value_type == np.datetime64:
        value = convert_to_r_posixct(value)
    else:
        value = [item if pd.notnull(item) else NA_TYPES[value_type]
                 for item in value]
        print("Is value_type == np.int64: %s") % (value_type is np.int64)
        value = VECTOR_TYPES[value_type](value)
        ...

And that's the result:

value_type: <type 'numpy.int64'>
Is value_type == np.int64: False

How can it be possible?? Given that the 32 bit Windows version has no problems, could be a problem with the 64 bit Linux Python version?

EDIT: Suggested by @lgautier, I modified this:

rdf = com.convert_to_r_dataframe(df)

to:

from rpy2.robjects import pandas2ri
rdf = pandas2ri.pandas2ri(df)

And that worked.

NOTE: My dataframe contains utf-8 special characters, as column names, decoded in unicode. When DataFrame constructor is called (contained in rpy2/robjects/vectors.py), this line try to encode the unicode string (that contain special characters) to an ascii string:

kv = [(str(k), conversion.py2ri(obj[k])) for k in obj]

This generate an error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)

To solve this I had to change that line in:

kv = [(k.encode('UTF-8'), conversion.py2ri(obj[k])) for k in obj]

Rpy2 should introduce a method that allows to change the encoding.

MrMoog
  • 427
  • 7
  • 18
  • After looking closer at the code, the use of `str(k)` is also not completely consistent. A quick fix would be to add this as a parameter to the `DataFrame` constructor but this would not completely solve the problem. May be this is just the kind of headache the massive change of string handling in Python 3 is addressing. – lgautier Feb 09 '15 at 20:55

1 Answers1

3

Consider using rpy2's own conversion (which appear to be working with int64 on Linux):

# create a test DataFrame
import numpy
import pandas

i2d = numpy.array([[1, 2, 3], [4, 5, 6]], dtype="int64")
colnames = ('a', 'b', 'c')
dataf = pandas.DataFrame(i2d, 
                         columns = colnames)

# rpy2's conversion of pandas objects
from rpy2.robjects import pandas2ri
pandas2ri.activate()

Now pandas DataFrame objects will be converted automatically to rpy2/R DataFrame objects on each call using the embedded R. For example:

from rpy2.robjects.packages import importr
# R's "base" package
base = importr('base')
# call the R function "summary"
print(base.summary(dataf))

One can also call the conversion explicitly:

from rpy2.robjects import conversion
rpy2_dataf = conversion.py2ro(dataf)

edit: (added customization to work around the str(k) issue)

Should anything related to the conversion be requiring local customization, this can be achieved relatively easily. One way to change the way the R DataFrame is built is:

import pandas.DataFrame as PandasDataFrame
import rpy2.robjects.vectors.DataFrame as RDataFrame
from rpy2 import rinterface
@conversion.py2ro.register(PandasDataFrame)
def py2ro_pandasdataframe(obj):
    ri_dataf = conversion.py2ri(obj)
    # cast down to an R list (goes through a different code path
    # in the DataFrame constructor, avoiding `str(k)`) 
    ri_list = rinterface.SexpVector(ri_dataf)
    return RDataFrame(ri_list)

From now on, the conversion function above will be used when a pandas DataFrame is present:

rpy2_dataf = conversion.py2ro(dataf)
lgautier
  • 11,363
  • 29
  • 42
  • I need to call the conversion explicitly, so I tried: `conversion.py2ro(df)` or alternately `pandas2ri.pandas2ri(df)` but they result in: `UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)`. This is probably due to column/row names that need encoding in 'UTF-8' and not in 'ascii'. There is a way to force UTF-8 encoding? – MrMoog Feb 08 '15 at 10:34
  • @MrMoog - It is hard to reproduce without a full example. There was a recent report of encoding issues that appear Windows-specific ( http://stackoverflow.com/questions/28247851/rpy2-korean-characters-are-not-working-on-rpy2 ). It would be interesting to see if this is a related issue. – lgautier Feb 08 '15 at 14:50
  • I forgot that rpy2 has a lack. In the `DataFrame` constructor (in `rpy2/robjects/vectors.py`), strings are automatically converted to ascii. In my dataframe there are unicode strings with special characters (originated from UTF-8 encoding) as column names. I had to change `kv = [(str(k), conversion.py2ri(obj[k])) for k in obj]` to `kv = [(k.encode('UTF-8'), conversion.py2ri(obj[k])) for k in obj]` – MrMoog Feb 09 '15 at 10:56
  • 1
    Strings are not converted to ASCII. As shown in the code you quote, the string representation (method `str`) as defined in the Python running is used. It happens that in Python 2 strings are bytes. There is may be a way to make a Python 2-specific patch, but I'd rather recommend to change the default encoding (see http://stackoverflow.com/questions/2276200/changing-default-encoding-of-python), or move to Python 3. – lgautier Feb 09 '15 at 13:30