-2

I'm trying to find out the best (best as in performance) to having a data frame of the form getting a new column called "Season" with each of the four seasons of the year: MON DAY YEAR 1 1 1 2010 2 1 1 2010 3 1 1 2010 4 1 1 2010 5 1 1 2010 6 1 1 2010

One straightforward to do this is create a loop conditioned on the MON and DAY column and assign the value one by one but I think there is a better way to do this. I've seen on other posts suggestions for ifelse or := or apply but most of the problem stated is just binary or the value can be assigned based on a given single function f based on the parameters.

In my situation I believe a vector containing the four stations labels and somehow the conditions would suffice but I don't see how to put everything together. My situation resembles more of a switch case.

  • Please observe the title of the question. The season example is just to contextualize. The question is more general. Just replace MON,DAY,YEAR by other columns and the solution provided on the following link won't answer my question title anymore.. – Oeufcoque Penteano Feb 14 '15 at 03:16
  • Several alternative here: http://stackoverflow.com/questions/24946955/format-date-time-as-seasons-in-r – IRTFM Feb 14 '15 at 03:22
  • @BondedDust Please read the comment above yours. – Oeufcoque Penteano Feb 14 '15 at 03:23
  • @OeufcoquePenteano you can nest `ifelse` to have more than two outcomes. See, for instance, http://stackoverflow.com/questions/18012222/nested-ifelse-statement-in-r – josliber Feb 14 '15 at 03:25
  • @josilber Is this superior to using some variation of apply with an ifelse function? – Oeufcoque Penteano Feb 14 '15 at 03:27
  • 1
    The first answer (mine) in the link _does_ provide the option for multiple outcomes. That was _why_ I offered it here. You need to clarify what your needs are by posting an example and a specific description of the desired result. – IRTFM Feb 14 '15 at 03:27

2 Answers2

1

Using modulo arithmetic and the fact that arithmetic operators coerce logical-values to 0/1 will be far more efficient if the number of rows is large:

d$SEASON <- with(d,  c( "Winter","Spring", "Summer", "Autumn")[
                               1+(( (DAY>=21) + MON-1) %/% 3)%%4 ] )

The first added "1" shifts the range of the %%4 operationon all the results inside the parentheses from 0:3 to 1:4. The second subtracted "1" shifts the (inner) 1:12 range back to 0:11 and the (DAY >= 21) advances the boundary months forward one.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
0

I'll start by giving a simple answer then I'll delve into the details. I quick way to do this would be to check the values of MON and DAY and output the correct season. This is trivial :

f=function(m,d){
  if(m==12 && d>=21) i=3
  else if(m>9 || (m==9 && d>=21)) i=2
  else if(m>6 || (m==6 && d>=21)) i=1
  else if(m>3 || (m==3 && d>=21)) i=0
  else i=3
}

This f function, given a day and a month, will return an integer corresponding to the season (it doesn't matter much if it's an integer or a string ; integer only allows to save a bit of memory but it's a technicality). Now you want to apply it to your data.frame. No need to use a loop for this ; we'll use mapply. d will be our simulated data.frame. We'll factor the output to have nice season names.

d=data.frame(MON=rep(1:12,each=30),DAY=rep(1:30,12),YEAR=2012))
d$SEA=factor(
  mapply(f,d$MON,d$DAY),
  levels=0:3,
  labels=c("Spring","Summer","Autumn","Winter")
)

There you have it !

I realize seasons don't always change a 21st. If you need fine tuning, you should define a 3-dimension array as a global variable to store the accurate days. Given a season and a year, you could access the corresponding day and replace the "21"s in the f function with the right calls (you would obviously add a third argument for the year).

About the things you mentionned in your question :

  • ifelse is the "functionnal" way to make a conditionnal test. On atomic variables it's only slightly better than the conditionnal statements but it is vectorized, meaning that if the argument is a vector, it will loop itself on its elements. I'm not familiar with it but it's the way to got for an optimized solution
  • mapply is derived from sapply of the "apply family" and allows to call a function with several arguments on vector (see ?mapply)
  • I don't think := is a standard operator in R, which brings me to my next point :
  • data.table ! It's a package that provides a new structure that extends data.frame for fast computing and typing (among other things). := is an operator in that package and allows to define new columns. In our case you could write d[,SEA:=mapply(f,MON,DAY)] if d is a data.table.

If you really care about performance, I can't insist enough on using data.table as it is a major improvement if you have a lot of data. I don't know if it would really impact time computing with the solution I proposed though.

YacineH
  • 63
  • 5
  • 1
    If you try this on a dataset with more than a a couple of hundred lines yiou will see that the performance differences between `ifelse` and `if(){}else{}` are substantial. – IRTFM Feb 14 '15 at 04:31
  • I didn't realize that ifelse was vectorized. I'm not familiar with its uses. I'll edit accordingly. – YacineH Feb 15 '15 at 21:21