2

As a learning exercise and because I'd like to do something similar with my own data, I'm trying to copy the answer to this example exactly but implement it in Python via rpy2.

This is turning out to be trickier than I thought because plyr uses a lot of convenient sytax (e.g. as.quoted variables, summarize, functions) that I haven't found easy to port to rpy2. Without even getting to the ggplot2 segment, this is what I've been able to manage so far, using **{} to allow use of the '.' arguments:

# import rpy2.robjects as ro
# from rpy2.robjects.packages import importr
# stats= importr('stats')
# plyr = importr('plyr')
# bs = importr('base')
# r = ro.r
# df = ro.DataFrame

mms = df( {'delicious': stats.rnorm(100), 
           'type':bs.sample(bs.as_factor(ro.StrVector(['peanut','regular'])), 100, replace=True),
           'color':bs.sample(bs.as_factor(ro.StrVector(['r','g','y','b'])), 100, replace=True)} )

# first define a function, then use it in ddply call
myfunc  = r('''myfunc <- function(var) {paste('n =', length(var))} ''')
mms_cor = plyr.ddply(**{'.data':mms, 
                        '.variables':ro.StrVector(['type','color']), 
                        '.fun':myfunc})

This runs without error, but printing the resulting mms_cor gives the following, which suggests the function isn't working correctly in the context of the ddply call (the length of the mms data.frame is 3, which is what I think is being calculated because other inputs to myfunc return different values):

     type color    V1
1  peanut     b n = 3
2  peanut     g n = 3
3  peanut     r n = 3
4  peanut     y n = 3
5 regular     b n = 3
6 regular     g n = 3
7 regular     r n = 3
8 regular     y n = 3 

Ideally I would get this to work with summarize, as done in the example answer, to have multiple calculations/label the output, but I couldn't get this to work either, and it really becomes awkward syntax-wise:

mms_cor = plyr.ddply(plyr.summarize, n=bs.paste('n =', bs.length('delicious')), 
                     **{'.data':mms,'.variables':ro.StrVector(['type','color'])})

This gives the same output as above with 'n = 1'. I know it's reflecting the length of the 1-item vector 'delicious', but can't figure out how to make this a variable instead of a string, or which variable it would be (which is why I moved toward the function above). Additionally, it would be useful to know how one might get the as.quoted variable syntax (e.g. ddply(.data=mms, .(type, color), ...)) to work with rpy2. I know plyr has several as_quoted methods, but I can't figure out how to use them because documentation and examples are tricky to find.

Any help is greatly appreciated. Thanks.

Edit:

lgautier's solution to fix myfunc with nrow not length.

myfunc = r('''myfunc <- function(var) {paste('n =', nrow(var))} ''')

Solution for ggplot2 if useful for others (note had to add x and y values to mms_cor as a workaround for using aes_string (can't get aes to work in Python environment):

#rggplot2 = importr('ggplot2') # note ggplot2 import above doesn't take 'mapping' kwarg
p = rggplot2.ggplot(data=mms, mapping=rggplot2.aes_string(x='delicious')) + \
    rggplot2.geom_density() + \
    rggplot2.facet_grid('type ~ color') + \
    rggplot2.geom_text(data=mms_cor, mapping=rggplot2.aes_string(x='x', y='y', label='V1'), colour='black', inherit_aes=False)

p.plot()
Community
  • 1
  • 1
williaster
  • 93
  • 1
  • 6
  • 1
    In the first part, the function is working correctly and "n = 3" is what one would expect. `length(var)` is the length of the R data.frame, which is 3. `nrows(var)` is the number of rows. – lgautier Jan 08 '13 at 08:13
  • nrow (minus the 's') did the trick, thanks!. will add the ggplot2 solution as edit. – williaster Jan 08 '13 at 19:09
  • In general, I would avoid writing R code in Python as much as you can. I would create one big R function called e.g. `plot_stuff`, `source` that into your rpy session and call that funtction with the appropriate parameters. This also makes debugging the R code easier. – Paul Hiemstra Jan 08 '13 at 19:43
  • 1
    That's a fantastic, simple tip! I can't say how much easier that will make things. – williaster Jan 08 '13 at 20:35
  • @PaulHiemstra I'd recommend the exact opposite to someone writing an application in Python from existing bits in R (the exception being that one developer or the mixed group of Python and R developer are more proficient in R than in Python for a given task), and this precisely for the purpose of debugging. – lgautier Jan 08 '13 at 20:55
  • @williaster the missing "mapping" is a bug (either long-standing but unnoticed, or a new feature introduced in recent changes to R's ggplot2) – lgautier Jan 08 '13 at 21:07
  • Could you elaborate @Igautier? I would write the R code in R files, and the Python code in Python files and keep the amount of R code in the python file (rpy code) to a minimum. Keeping the interface at a minimum is important in detaching the R and Python code. – Paul Hiemstra Jan 08 '13 at 21:08
  • @PaulHiemstra The divide between the "R code" and the "Python code" come from existing R code (that's clearly "R code") and the ability of the person/team to come up with the blocks needed to build the application in R or Python, this being defined for each block. If someone is more comfortable with Python, there is little advantage in starting to write potentially not well-understood R code in a string and evaluate it. Keeping the logic in Python allows debugging to occur just like any other Python program. – lgautier Jan 09 '13 at 15:02
  • @PaulHiemstra (continued) In other words, what is/should be R code and what is/should be Python code is not universal and depends on what is readily available (either code or skills). Since rpy2 is about developing Python applications, while being able to call (or parse and evaluate) R code, I argue that it makes more sense to keep more logic on the Python side. If performances are a concern, interfacing (that is crossing between Python and R) does have a cost but this approach can still be made faster than pure R (because R is slower than Python - still benchmark in the rpy2 documentation). – lgautier Jan 09 '13 at 15:07
  • @PaulHiemstra (final) If you are equally comfortable with both languages (R and Python), the divide is comparable to making code modular, like in what should be broken down into function, with the added parameter that different ideas can be expressed in one or the other language. That's more an art than a science, I think: there is no one right way but there nice ways... and less-nice ways. – lgautier Jan 09 '13 at 15:16

1 Answers1

2

Since you are using a callback, I can't resist showing one of the unexpected things rpy2 can do (note: the code is untested, there might be typos) :

def myfunc(var):
    # var is a data.frame, the length of
    # the first vector is the number of rows
    if len(var) == 0:
        nr = 0
    else:
        nr = len(var[0])
    # any string format feature in Python could
    # be used here
    return 'n = %i' % nr 

# create R function from the Python function
from rpy2.rinterface import rternalize
myfunc_r = rternalize(myfunc)

mms_cor = plyr.ddply(**{'.data':mms, 
                        '.variables':ro.StrVector(['type','color']), 
                        '.fun':myfunc_r})
lgautier
  • 11,363
  • 29
  • 42