I create a pandas DataFrame:
import pandas as pd
df = pd.DataFrame(x.toarray(), columns = colnames)
Then I convert it to a R dataframe:
import pandas.rpy.common as com
rdf = com.convert_to_r_dataframe(df)
Under Windows with this configuration there are no problems:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.7.final.0
python-bits: 32
OS: Windows
OS-release: 7
machine: AMD64
processor: AMD64 Family 16 Model 4
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
But when I execute it on Linux with this configuration:
>>> pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.2.0-29-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.14.1
numpy: 1.8.2
rpy2: 2.4.4
...
I get this:
Traceback (most recent call last):
File "app.py", line 232, in <module>
clf.global_cl(df, df2)
File "/home/uzer/app/util/clftool.py", line 202, in global_cl
rdf = com.convert_to_r_dataframe(df)
File "/home/uzer/app/venv/local/lib/python2.7/site-packages/pandas/rpy/common.py", line 324, in convert_to_r_dataframe
value = VECTOR_TYPES[value_type](value)
KeyError: <type 'numpy.int64'>
It seems that VECTOR_TYPES does not have <type 'numpy.int64'>
as key. But this is not true:
VECTOR_TYPES = {np.float64: robj.FloatVector,
np.float32: robj.FloatVector,
np.float: robj.FloatVector,
np.int: robj.IntVector,
np.int32: robj.IntVector,
np.int64: robj.IntVector,
np.object_: robj.StrVector,
np.str: robj.StrVector,
np.bool: robj.BoolVector}
So I printed variable type in convert_to_r_dataframe
(in ../pandas/rpy/common.py
):
for column in df:
value = df[column]
value_type = value.dtype.type
print("value_type: %s") % value_type
if value_type == np.datetime64:
value = convert_to_r_posixct(value)
else:
value = [item if pd.notnull(item) else NA_TYPES[value_type]
for item in value]
print("Is value_type == np.int64: %s") % (value_type is np.int64)
value = VECTOR_TYPES[value_type](value)
...
And that's the result:
value_type: <type 'numpy.int64'>
Is value_type == np.int64: False
How can it be possible?? Given that the 32 bit Windows version has no problems, could be a problem with the 64 bit Linux Python version?
EDIT: Suggested by @lgautier, I modified this:
rdf = com.convert_to_r_dataframe(df)
to:
from rpy2.robjects import pandas2ri
rdf = pandas2ri.pandas2ri(df)
And that worked.
NOTE: My dataframe contains utf-8 special characters, as column names, decoded in unicode. When DataFrame
constructor is called (contained in rpy2/robjects/vectors.py
), this line try to encode the unicode string (that contain special characters) to an ascii string:
kv = [(str(k), conversion.py2ri(obj[k])) for k in obj]
This generate an error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 4: ordinal not in range(128)
To solve this I had to change that line in:
kv = [(k.encode('UTF-8'), conversion.py2ri(obj[k])) for k in obj]
Rpy2 should introduce a method that allows to change the encoding.