1

I have some data. ID and date and I'm trying to create a new field for semester.

df:

id  date
1   20160822
2   20170109
3   20170828
4   20170925
5   20180108
6   20180402
7   20160711
8   20150831
9   20160111
10  20160502
11  20160829
12  20170109
13  20170501

I also have a semester table:

start       end         season_year
20120801    20121222    Fall-2012
20121223    20130123    Winter-2013
20130124    20130523    Spring-2013
20130524    20130805    Summer-2013
20130806    20131228    Fall-2013
20131229    20140122    Winter-2014
20140123    20140522    Spring-2014
20140523    20140804    Summer-2014
20140805    20141227    Fall-2014
20141228    20150128    Winter-2015
20150129    20150528    Spring-2015
20150529    20150803    Summer-2015
20150804    20151226    Fall-2015
20151227    20160127    Winter-2016
20160128    20160526    Spring-2016
20160527    20160801    Summer-2016
20160802    20161224    Fall-2016
20161225    20170125    Winter-2017
20170126    20170525    Spring-2017
20170526    20170807    Summer-2017
20170808    20171230    Fall-2017
20171231    20180124    Winter-2018
20180125    20180524    Spring-2018
20180525    20180806    Summer-2018
20180807    20181222    Fall-2018
20181223    20190123    Winter-2019
20190124    20190523    Spring-2019
20190524    20180804    Summer-2019

I'd like to create a new field in df if df$date is between semester$start and semester$end, then place the respective value semester$season_year in df

I tried to see if the lubridate package could help but that seems to be more for calculations

I saw this question and it seems to be the closest to what i want, but, to make things more complicated, not all of our semesters are six months

tangerine7199
  • 443
  • 2
  • 8
  • 24
  • Take a look at this question. It's not quite the same, but it should get you 90% of the way there: https://stackoverflow.com/questions/37933800/test-if-date-occurs-in-multiple-date-ranges-with-r – divibisan Jun 26 '18 at 19:00

2 Answers2

2

Does this work?

library(lubridate)

semester$start <- ymd(semester$start)
semester$end <- ymd(semester$end)
df$date <- ymd(df$date)

LU <-  Map(`:`, semester$start, semester$end)
LU <- data.frame(value = unlist(LU),
                 index = rep(seq_along(LU), lapply(LU, length)))


df$semester <- semester$season_year[LU$index[match(df$date, LU$value)]]
1

A solution using non-equi update joins using data.table and lubridate package can be as:

library(data.table)

setDT(df)
setDT(semester)


df[,date:=as.IDate(as.character(date), format = "%Y%m%d")]
semester[,':='(start = as.IDate(as.character(start), format = "%Y%m%d"), 
                         end=as.IDate(as.character(end), format = "%Y%m%d"))]


df[semester, on=.(date >= start, date <= end), season_year := i.season_year]

df
#    id       date season_year
# 1:  1 2016-08-22   Fall-2016
# 2:  2 2017-01-09 Winter-2017
# 3:  3 2017-08-28   Fall-2017
# 4:  4 2017-09-25   Fall-2017
# 5:  5 2018-01-08 Winter-2018
# 6:  6 2018-04-02 Spring-2018
# 7:  7 2016-07-11 Summer-2016
# 8:  8 2015-08-31   Fall-2015
# 9:  9 2016-01-11 Winter-2016
# 10: 10 2016-05-02 Spring-2016
# 11: 11 2016-08-29   Fall-2016
# 12: 12 2017-01-09 Winter-2017
# 13: 13 2017-05-01 Spring-2017

Data:

df <- read.table(text="
id  date
1   20160822
2   20170109
3   20170828
4   20170925
5   20180108
6   20180402
7   20160711
8   20150831
9   20160111
10  20160502
11  20160829
12  20170109
13  20170501",
header = TRUE, stringsAsFactors = FALSE)


semester <- read.table(text="
start       end         season_year
20120801    20121222    Fall-2012
20121223    20130123    Winter-2013
20130124    20130523    Spring-2013
20130524    20130805    Summer-2013
20130806    20131228    Fall-2013
20131229    20140122    Winter-2014
20140123    20140522    Spring-2014
20140523    20140804    Summer-2014
20140805    20141227    Fall-2014
20141228    20150128    Winter-2015
20150129    20150528    Spring-2015
20150529    20150803    Summer-2015
20150804    20151226    Fall-2015
20151227    20160127    Winter-2016
20160128    20160526    Spring-2016
20160527    20160801    Summer-2016
20160802    20161224    Fall-2016
20161225    20170125    Winter-2017
20170126    20170525    Spring-2017
20170526    20170807    Summer-2017
20170808    20171230    Fall-2017
20171231    20180124    Winter-2018
20180125    20180524    Spring-2018
20180525    20180806    Summer-2018
20180807    20181222    Fall-2018
20181223    20190123    Winter-2019
20190124    20190523    Spring-2019
20190524    20180804    Summer-2019",
header = TRUE, stringsAsFactors = FALSE)
MKR
  • 19,739
  • 4
  • 23
  • 33
  • Thank you for your suggestion! I started it about twenty minutes ago, to add a new field to `df` and it hasn't ended yet. do you have any thoughts that can help speed it up? – tangerine7199 Jun 26 '18 at 19:48
  • @Walker Only option can be to not use `lubridate`. Instead you can use `as.IDate` from `data.table` itself. Hope that helps. – MKR Jun 26 '18 at 20:07
  • @Walker I have updated the answer. Thanks for pointing out. I have modified answer to use `IDate`. Please let us know performance benefit as it would help future users as well. – MKR Jun 26 '18 at 20:28
  • Use [`data.table::between()`](https://www.rdocumentation.org/packages/data.table/versions/1.11.4/topics/between) instead of `(date >= start, date <= end)` – smci Jun 26 '18 at 20:40