2

I am using the randomForest library in R via RPy2. I would like to pass back the values calculated using the caret predict method and join them to the original pandas dataframe. See example below.

import pandas as pd
import numpy as np
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r = robjects.r
r.library("randomForest")
r.library("caret")

df = pd.DataFrame(data=np.random.rand(100, 10), columns=["a{}".format(i) for i in range(10)])
df["b"] = ['a' if x < 0.5 else 'b' for x in np.random.sample(size=100)]
train = df.ix[df.a0 < .75]
withheld = df.ix[df.a0 >= .75]

rf = r.randomForest(robjects.Formula('b ~ .'), data=train)
pr = r.predict(rf, withheld)
print pr.rx()

Which returns

 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
 a  a  b  b  b  a  a  a  a  b  a  a  a  a  a  b  a  a  a  a 
Levels: a b

But how can join this to the withheld dataframe or compare to the original values?

I have tried this:

import pandas.rpy.common as com
com.convert_robj(pr)

But this returns a dictionary where the keys are strings. I guess there is a work around of withheld.reset_index() and then converting the dict keys to integers and then joining the two but there must be a simpler way!

kungphil
  • 1,759
  • 2
  • 18
  • 27

2 Answers2

3

There is a pull-request that adds R factor to Pandas Categorical functionality to Pandas. It has not yet been merged into the Pandas master branch. When it is,

import pandas.rpy.common as rcom
rcom.convert_robj(pr)

will convert pr to a Pandas Categorical. Until then, you can use as a workaround:

def convert_factor(obj):
    """
    Taken from jseabold's PR: https://github.com/pydata/pandas/pull/9187
    """
    ordered = r["is.ordered"](obj)[0]
    categories = list(obj.levels)
    codes = np.asarray(obj) - 1  # zero-based indexing
    values = pd.Categorical.from_codes(codes, categories=categories,
                                       ordered=ordered)
    return values

For example,

import pandas as pd
import numpy as np
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r = robjects.r
r.library("randomForest")
r.library("caret")

def convert_factor(obj):
    """
    Taken from jseabold's PR: https://github.com/pydata/pandas/pull/9187
    """
    ordered = r["is.ordered"](obj)[0]
    categories = list(obj.levels)
    codes = np.asarray(obj) - 1  # zero-based indexing
    values = pd.Categorical.from_codes(codes, categories=categories,
                                       ordered=ordered)
    return values


df = pd.DataFrame(data=np.random.rand(100, 10), 
                  columns=["a{}".format(i) for i in range(10)])
df["b"] = ['a' if x < 0.5 else 'b' for x in np.random.sample(size=100)]
train = df.ix[df.a0 < .75]
withheld = df.ix[df.a0 >= .75]

rf = r.randomForest(robjects.Formula('b ~ .'), data=train)
pr = convert_factor(r.predict(rf, withheld))

withheld['pr'] = pr
print(withheld)
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
1

The R object pr returned by the function predict is a "vector", which you can think of as a Python array.array, or numpy one-dimensional array.

"Joining" is not necessary, in the sense that the ordering of the elements in pr correspond to the rows in the table withheld. One only needs to add pr as an additional column to withheld (see Adding new column to existing DataFrame in Python pandas):

withheld['predictions'] = pd.Series(pr,
                                    index=withheld.index)

By default this will add a column of integers (because R factors are encoded as integers). One can customize rpy2's conversion rather simply (see http://rpy.sourceforge.net/rpy2/doc-2.5/html/robjects_convert.html):

note: The version 2.6.0 of rpy2 will include the handling of pandas Categorical vectors, making the customization of the converter described below unnecessary.

@robjects.conversion.ri2py.register(robjects.rinterface.SexpVector)
def ri2py_vector(vector):
    # based on
    # https://bitbucket.org/rpy2/rpy2/src/a75413b09852991869332da615fa754923c32039/rpy/robjects/pandas2ri.py?at=default#cl-73

    # special case for factors
    if 'factor' in vector.rclass:
        res = pd.Categorical.from_codes(np.asarray(vector) - 1,
                                        categories = vector.do_slot('levels'),
                                        ordered = 'ordered' in vector.rclass)
    else:
        # use the numpy converter first
        res = numpy2ri.ri2py(obj)
    if isinstance(res, recarray):
        res = PandasDataFrame.from_records(res)
    return res

With this, the conversion of any rpy2 object into an non-rpy2 object will be returning a pandas Categorical whenever there is an R factor:

robjects.conversion.ri2py(pr)

You may decide to add the result of this last conversion to your data table.

Note that the conversion to non-rpy2 objects has to be explicit (one is calling the converter). If you are using ipython, there is a way to make this implicit: https://gist.github.com/lgautier/e2e8709776e0e0e93b8d (and the originating thread https://bitbucket.org/rpy2/rpy2/issue/230/rmagic-specific-conversion).

Community
  • 1
  • 1
lgautier
  • 11,363
  • 29
  • 42