1

Forgive me if I'm asking too basic a question here (I'm not too experienced in R), but I'm currently trying to plot some natural cubic splines in R and I'm running up against a wall.

I have a data set which has ~3500 rows and about 30 columns. This is a data set of single-season baseball statistics for about 270 different baseball players over their entire careers. So basically, I have about 270 time series (one for each player).

I'm interested in player performance as measured by this thing called wOBA over time, so I want to fit a natural cubic spline to each and then overlay all the splines on one graph. And yes, it must be a natural cubic spline. And as far as I know, this is the only way to do it in ggplot.

My current code for doing this is:

  #initialize plot
  plot <- ggplot(data, aes(x=age, y=wOBA, color=playerID, group=playerID)) + theme(legend.position="none")

  #loop through players to add splines
  for (i in unique(data$playerID)) {
    plot <- plot + stat_smooth(method = lm, formula = y~ns(x,3), data=data[which(data$playerID=="i"),list(playerID,age,wOBA)], se=FALSE)
}

I have checked that I can run the code snippet inside the loop manually for a couple of different players, and the plot turns out exactly as I want it. But when I try to run this loop, it takes forever. I checked the memory usage as this loop was running and it definitely ran out (I am on a 4gb machine).

I'm a little confused as to why this is. I would not have expected that fitting just 270 splines would cause R to completely use up >2gb free memory at the time of execution.

I'm somewhat new to R, so I'm sure I'm missing something. Can anyone give any pointers? Sorry if this is a completely bone-headed question!

gogurt
  • 811
  • 1
  • 8
  • 24
  • 2
    Precalculate your statistics (first column value, second column player ID and third column time) and plot that as one figure. – Roman Luštrik Oct 06 '13 at 22:03
  • 4
    you shouldn't need to do this in a loop at all. Since you have defined `group=playerID` you should just be able to add `stat_smooth(method = lm, formula = y~ns(x,3), se=FALSE)` to `plot` a *single time* and it should Just Work. And it should be much faster. – Ben Bolker Oct 07 '13 at 00:07
  • @BenBolker: wow, you're completely right. This just shows how little I truly understand about how ggplot works. Thanks so much, and it looks like I'll need to spend some more time with the documentation. – gogurt Oct 07 '13 at 01:03
  • While it's true that using `stat_smooth` inside the `plot` object, you may be better off in the long run doing all your spline work on the original matrix/dataframe object and then plotting. The advantage is that you can modify (e.g. adding one new player) the data without having to recalculate the entire dataset. – Carl Witthoft Oct 07 '13 at 11:31

0 Answers0