2

I'm trying to speedup a process using Pandas and R.

Suppose that I have the following dataframe:

import pandas as pd
from random import randint
df = pd.DataFrame({'mpg': [randint(1, 9) for x in xrange(10)],
                   'wt': [randint(1, 9)*10 for x in xrange(10)],
                   'cyl': [randint(1, 9)*100 for x in xrange(10)]})
df
  mpg wt  cyl
0  3  40  100
1  6  30  200
2  7  70  800
3  3  50  200
4  7  50  400
5  4  10  400
6  3  70  500
7  8  30  200
8  3  40  800
9  6  60  200

then, I use rpy2 to model some data:

import rpy2.robjects.packages as rpackages
import rpy2.robjects as robjects
from rpy2.robjects import pandas2ri
pandas2ri.activate()

base = rpackages.importr('base')
stats = rpackages.importr('stats')

formula = 'mpg ~ wt + cyl'
fit_full = stats.lm(formula, data=df)

after this I make some predictions:

rfits = stats.predict(fit_full, newdata=df)

This code runs without problems for a small dataframe, but actually I have a big dataframe with millions of lines and I'm trying to speedup the prediction part using other rpy2 models, but unfortunately this takes a long time to process.

I've tried to use for the first time the multiprocessing library for this task without success:

import multiprocessing as mp

pool = mp.Pool(processes=4)
rfits = pool.map(predict(fit_full, newdata=df))

but probably I'm doing something wrong since I can't see any speed improvement.

I think the main problem here, is because I'm trying to apply the pool.map to rpy2 function and not a Python predefined function. Probably there is some workaround solution for this without using the multiprocessing library, but I can't see any.

Any help would be greatly appreciated. Thanks in advance.

npires
  • 6,093
  • 2
  • 13
  • 9

1 Answers1

1

Have you tried using StatsModels?

Fitting models using R-style formulas Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs

import statsmodels.formula.api as smf

formula = 'mpg ~ wt + cyl'
model = smf.ols(formula=formula, data=df)
params = model.fit().params

>>> params
params
Intercept    5.752803
wt           0.037770
cyl         -0.004112

>>> model.predict(params, exog=df)
array([ 1725.83759267,  2876.50148582,   575.25352613,  1150.6605447 ,
        1150.51281171,  3451.54178359,   575.53800931,   575.4146529 ,
        2876.58372342,  5177.46831077])
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • yes, I'm aware of statsmodels, but actually I need to use a gam (generalized additive model) which I believe that is not implemented in statsmodels yet. – npires Apr 20 '15 at 19:45
  • 2
    Correct. It is part of their 'Sandbox' http://statsmodels.sourceforge.net/devel/sandbox.html – Alexander Apr 20 '15 at 19:55
  • wow! I didn't know about that! I'll give a feedback after testing it! thanks. – npires Apr 20 '15 at 20:14
  • GAM in statsmodels sandbox is not in a usable state. (There might be something useful before the end of the year.) However, patsy, the formula package used by statsmodels, can create design matrices with splines that can be used with any model. – Josef Apr 21 '15 at 00:33