21

What is the recommended way (if any) for doing linear regression using a pandas dataframe? I can do it, but my method seems very elaborate. Am I making things unnecessarily complicated?

The R code, for comparison:

x <- c(1,2,3,4,5)
y <- c(2,1,3,5,4)
M <- lm(y~x)
summary(M)$coefficients
            Estimate Std. Error  t value  Pr(>|t|)
(Intercept)      0.6  1.1489125 0.522233 0.6376181
x                0.8  0.3464102 2.309401 0.1040880

Now, my python (2.7.10), rpy2 (2.6.0), and pandas (0.16.1) version:

import pandas
import pandas.rpy.common as common
from rpy2 import robjects
from rpy2.robjects.packages import importr

base = importr('base')
stats = importr('stats')

dataframe = pandas.DataFrame({'x': [1,2,3,4,5], 
                              'y': [2,1,3,5,4]})

robjects.globalenv['dataframe']\
   = common.convert_to_r_dataframe(dataframe) 

M = stats.lm('y~x', data=base.as_symbol('dataframe'))

print(base.summary(M).rx2('coefficients'))

            Estimate Std. Error  t value  Pr(>|t|)
(Intercept)      0.6  1.1489125 0.522233 0.6376181
x                0.8  0.3464102 2.309401 0.1040880

By the way, I do get a FutureWarning on the import of pandas.rpy.common. However, when I tried the pandas2ri.py2ri(dataframe) to convert a dataframe from pandas to R (as mentioned here), I get

NotImplementedError: Conversion 'py2ri' not defined for objects of type '<class 'pandas.core.series.Series'>'
mjandrews
  • 2,392
  • 4
  • 22
  • 39

3 Answers3

28

After calling pandas2ri.activate() some conversions from Pandas objects to R objects happen automatically. For example, you can use

M = R.lm('y~x', data=df)

instead of

robjects.globalenv['dataframe'] = dataframe
M = stats.lm('y~x', data=base.as_symbol('dataframe'))

import pandas as pd
from rpy2 import robjects as ro
from rpy2.robjects import pandas2ri
pandas2ri.activate()
R = ro.r

df = pd.DataFrame({'x': [1,2,3,4,5], 
                   'y': [2,1,3,5,4]})

M = R.lm('y~x', data=df)
print(R.summary(M).rx2('coefficients'))

yields

            Estimate Std. Error  t value  Pr(>|t|)
(Intercept)      0.6  1.1489125 0.522233 0.6376181
x                0.8  0.3464102 2.309401 0.1040880
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
14

The R and Python are not strictly identical because you build a data frame in Python/rpy2 whereas you use vectors (without a data frame) in R.

Otherwise, the conversion shipping with rpy2 appears to be working here:

from rpy2.robjects import pandas2ri
pandas2ri.activate()
robjects.globalenv['dataframe'] = dataframe
M = stats.lm('y~x', data=base.as_symbol('dataframe'))

The result:

>>> print(base.summary(M).rx2('coefficients'))
            Estimate Std. Error  t value  Pr(>|t|)
(Intercept)      0.6  1.1489125 0.522233 0.6376181
x                0.8  0.3464102 2.309401 0.1040880
lgautier
  • 11,363
  • 29
  • 42
  • Nice. Thank you. I knew my initial attempt was probably over-complicating things. – mjandrews Jul 01 '15 at 13:37
  • 1
    @l Unutbu's answer looks really intuitive as there is no need to assign the DF in the R namespace or use as_symbol. Is this method of passing a pandas DF directly to the r function like ununtbu's example supported syntax or will it be deprecated? My perusal through the documentation hasn't yielded na answer. – KGS Aug 23 '15 at 14:34
  • @KGS : my answer focused on invalidating the claim that the conversion of data frames is not working. To do so I kept the code in the question unchanged as much as possible. I don't see @unutbu 's answer becoming invalid any time soon: R's `stats::lm` has always accepted a parameter `data`, and I don't think it would change easly. – lgautier Aug 24 '15 at 03:22
3

I can add to unutbu's answer by outlining how to retrieve particular elements of the coefficients table including, crucially, the p-values.

def r_matrix_to_data_frame(r_matrix):
    """Convert an R matrix into a Pandas DataFrame"""
    import pandas as pd
    from rpy2.robjects import pandas2ri
    array = pandas2ri.ri2py(r_matrix)
    return pd.DataFrame(array,
                        index=r_matrix.names[0],
                        columns=r_matrix.names[1])

# Let's start from unutbu's line retrieving the coefficients:
coeffs = R.summary(M).rx2('coefficients')
df = r_matrix_to_data_frame(coeffs)

This leaves us with a DataFrame which we can access in the normal way:

In [179]: df['Pr(>|t|)']
Out[179]:
(Intercept)    0.637618
x              0.104088
Name: Pr(>|t|), dtype: float64

In [181]: df.loc['x', 'Pr(>|t|)']
Out[181]: 0.10408803866182779
LondonRob
  • 73,083
  • 37
  • 144
  • 201