Why do I lose names of rows and columns when using rpy2?

Question

I'm using R's MatchIt package via Python's rpy2 package. I transfer results from R to Python. While this transfer I lose the names of rows and columns, but only in a specific situtaion. And I would like to understand what is the diferecence here.

R code

First of all please let me show the original R script. But keep in mind this script is not executed by Python. The rpy2 package (see in next section) use a different approach to use R stuff. The two Variants you can see in that code are relevant in the next section.

library("MatchIt")
data("lalonde")

# simplify
lalonde = lalonde[,c("treat", "age", "race", "married")]

# matching
match_out <- matchit(
    treat ~ age + race + married,
    data = lalonde,
    method = "nearest",
    distance = "glm"
)

## Variant A
balance_A <- result <- as.data.frame(summary(match_out)$sum.matched)

## Variant B
sum_matched <- summary(match_out)$sum.matched
balance_B <- as.data.frame(sum_matched)

The objects balance_A and balance_B are equal and look like this.

> balance_A
           Means Treated Means Control Std. Mean Diff. Var. Ratio  eCDF Mean  eCDF Max Std. Pair Dist.
distance      0.56610932     0.3620326       0.9661981  0.6473161 0.13317246 0.4000000       0.9687231
age          25.81621622    28.1027027      -0.3195640  0.4220499 0.08527027 0.1621622       1.1687127
raceblack     0.84324324     0.4702703       1.0258593         NA 0.37297297 0.3729730       1.0258593
racehispan    0.05945946     0.3135135      -1.0743033         NA 0.25405405 0.2540541       1.3028784
racewhite     0.09729730     0.2162162      -0.4012621         NA 0.11891892 0.1189189       0.4742189
married       0.18918919     0.2918919      -0.2622249         NA 0.10270270 0.1027027       0.6762642

Python code

Here you see the same approach in Python code using rpy2 package.

#!/usr/bin/env python3
import rpy2
from rpy2.robjects.packages import importr, data
import rpy2.robjects as robjects
import rpy2.robjects.pandas2ri as pandas2ri
import pydataset

if __name__ == '__main__':
    # For converting objects from/into Pandas <-> R
    # Credits: https://stackoverflow.com/a/20808449/4865723)
    pandas2ri.activate()

    # import
    matchit_pkg = robjects.packages.importr('MatchIt')

    # data
    df = robjects.r('''
        library(MatchIt)
        data(lalonde)
        return(lalonde)
    ''')
    df = df.loc[:, ['treat', 'age', 'race', 'married']]

    # get match object
    match_out = robjects.r['matchit'](
        formula=robjects.Formula('treat ~ age + race + married'),
        data=df,
        method='nearest',
        distance='glm')

    ## Variant A
    print('\n-- Variant A --')

    get_balance_dataframe = robjects.r('''f <- function(match_out) {
        result <- as.data.frame(summary(match_out)$sum.matched)
        return(result)
    }
    ''')
    balance_A = get_balance_dataframe(match_out)
    balance_A = robjects.conversion.rpy2py(balance_A)
    print(balance_A)  # <--- OK

    ## Variant B
    print('\n-- Variant B --')

    get_sum_matched = robjects.r('''f <- function(match_out) {
        result <- summary(match_out)$sum.matched
        return(result)
    }
    ''')
    sum_matched = get_sum_matched(match_out)
    print(sum_matched)  # <--- Looks like a matrix

    matrix_to_dataframe = robjects.r('''f <- function(a_matrix) {
        result <- as.data.frame(a_matrix)
        return(result)
    }''')
    balance_B = matrix_to_dataframe(sum_matched)
    balance_B = robjects.conversion.rpy2py(balance_B)
    print(balance_B)  # <--- Names of rows and columns lost

Output

Variant A is OK

This seems OK.

-- Variant A --
          Means Treated  Means Control  Std. Mean Diff.  Var. Ratio  eCDF Mean  eCDF Max  Std. Pair Dist.
distance       0.560643       0.378393         0.898469    0.689696   0.132819  0.400000         0.902191
age           25.816216      28.016216        -0.307476    0.418415   0.086622  0.162162         1.316785
race           1.254054       1.729730        -0.765436    0.643151   0.158559  0.372973         0.765436
married        0.189189       0.308108        -0.303629         NaN   0.118919  0.118919         0.607258

Variant B has a problem

Here the names of columns and rows are lost.

          V1         V2        V3        V4        V5        V6        V7
1   0.560643   0.378393  0.898469  0.689696  0.132819  0.400000  0.902191
2  25.816216  28.016216 -0.307476  0.418415  0.086622  0.162162  1.316785
3   1.254054   1.729730 -0.765436  0.643151  0.158559  0.372973  0.765436
4   0.189189   0.308108 -0.303629       NaN  0.118919  0.118919  0.607258

`sum_matched` is numpy array that's why the row and column labels are lost. In other words, `result` returned by `get_sum_matched` R function is not a data frame (appears to be `FloatSexpVector`), so the labels were lost presumably in R. — kesh, Aug 05 '22 at 01:51
I know that is a numpy array. But why does it work in Variant A? Isn't it a numpy array there, too? — buhtz, Aug 05 '22 at 06:16

kesh · Answer 1 · 2022-08-05T16:30:15.753

This happens because rpy2's default numpy converter ignores row and column names. Specifically, this line is the culprit (as commented above):

sum_matched = get_sum_matched(match_out)

The R function called by get_sum_matched() returns a array with row and column names. But rpy2's default autoconverter ignores these names. Hence, the names are lost after this line.

To retain the names, you must write your own converter and overload the default one. Here is what I tried in the past:


local_rules = ro.default_converter + pandas2ri.converter

@local_rules.rpy2py.register(rinterface.FloatSexpVector)
def rpy2py_floatvector(obj):
    x = np.array(obj)
    try:
        # if names is assigned, convert to pandas series
        return pd.Series(x, obj.names)
    except:
        # if dimnames assigned, it's a named matrix, convert to pandas dataframe
        try:
            rownames, colnames = obj.do_slot("dimnames")
            x = pd.DataFrame(x, index=rownames, columns=colnames)
        finally:
            # plain vector/matrix
            return x

with local_rules:
    balance_B = get_sum_matched(match_out)

The decorator should overwrite the default conversion routine from an R array. The custom converter outputs Pandas dataframe if row or column names are defined. So, matrix_to_dataframe is no longer needed.

One thing to note. I don't use pandas2ri.activate() and so I don't know how it interacts with a local converter.