0

I am using R's MatchIt package but calling it from Python via the pyr2 package.

On the R-side MatchIt gives me a complex result object including raw data and some additional statistic information. One of is a matrix I want to transform into a data set which I can do in R code like this

# R Code
m.out <- matchit(....)
m.sum <- summary(m.out)

# The following two lines should be somehow "translated" into
# Pythons rpy2
balance <- m.sum$sum.matched
balance <- as.data.frame(balance)

My problem is that I don't know how to implement the two last lines with Pythons rpy2 package. I am able to get m.out and m.sum with rpy2.

See this MWE please

#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset

if __name__ == '__main__':

    # import
    robjects.packages.importr('MatchIt')

    # data
    p_df = pydataset.data('respiratory')
    p_df.treat = p_df.treat.replace({'P': 0, 'A': 1})

    # Convert Panda data into R data
    with robjects.conversion.localconverter(
        robjects.default_converter + pandas2ri.converter):
        r_df = robjects.conversion.py2rpy(p_df)

    # Call R's matchit with R data object
    match_out = robjects.r['matchit'](
        formula=robjects.Formula('treat ~ age + sex'),
        data=r_df,
        method='nearest',
        distance='glm')

    # matched data
    match_data = robjects.r['match.data'](match_out)

    # Convert R data into Pandas data
    with robjects.conversion.localconverter(
        robjects.default_converter + pandas2ri.converter):
        match_data = robjects.conversion.rpy2py(match_data)

    # summary object
    match_sum = robjects.r['summary'](match_out)

    # x = robjects.r('''
    # balance <- match_sum$sum.matched
    # balance <- as.data.frame(balance)
    #
    # balance
    # ''')

When inspecting the python object match_sum I can't find anything like sum.matched in it. So I have to "translate" the match_sum$sum.matched somehow with rpy2. But I don't know how.

An alternative solution would be to run everything as R code with robjects.r(''' # r code ...'''). But in that case I don't know how to bring a Pandas data frame into that code.

EDIT: Be aware that in the MWE presented here the conversion from R objects into Python objects and vis-à-vis an outdated solution is used. Please see the answer below for a better one.

buhtz
  • 10,774
  • 18
  • 76
  • 149

1 Answers1

0

Ah, it is always the same phenomena: While formulating the question the answers jump'n right into your face.

My (maybe not the best) solution is:

  • Use real R code and run it with rpy2.robjects.r().
  • That R code need to create an R function() to be able to receive a dataframe from the outside (the caller).

Beside that solution and based on another answer I also modified the conversion from R to Python data frames in that code.

#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset

if __name__ == '__main__':
    # For converting objects from/into Pandas <-> R
    # Credits: https://stackoverflow.com/a/20808449/4865723
    pandas2ri.activate()

    # import
    robjects.packages.importr('MatchIt')

    # data
    df = pydataset.data('respiratory')
    df.treat = df.treat.replace({'P': 0, 'A': 1})

    # match object
    match_out = robjects.r['matchit'](
        formula=robjects.Formula('treat ~ age + sex'),
        data=df,
        method='nearest',
        distance='glm')

    # matched data
    match_data = robjects.r['match.data'](match_out)
    match_data = robjects.conversion.rpy2py(match_data)

    # SOLUTION STARTS HERE:
    get_balance_dataframe = robjects.r('''f <- function(match_out) {
        as.data.frame(summary(match_out)$sum.matched)
    }
    ''')
    balance = get_balance_dataframe(match_out)
    balance = robjects.conversion.rpy2py(balance)

    print(type(balance))
    print(balance)

Here is the output.

<class 'pandas.core.frame.DataFrame'>
          Means Treated  Means Control  Std. Mean Diff.  Var. Ratio  eCDF Mean  eCDF Max  Std. Pair Dist.
distance       0.514630       0.472067         0.471744    0.512239   0.077104  0.203704         0.507222
age           32.888889      34.129630        -0.089355    1.071246   0.063738  0.203704         0.721511
sexF           0.111111       0.259259        -0.471405         NaN   0.148148  0.148148         0.471405
sexM           0.888889       0.740741         0.471405         NaN   0.148148  0.148148         0.471405

EDIT: Take that there are no umlauts or other unicode-problematic characters in the cell values or in the column and row names when you do this on Windows. From time to time then there comes a unicode decode error. I wasn't able to reproduce this stable so I have no fresh bug report about it.

buhtz
  • 10,774
  • 18
  • 76
  • 149