0

I am trying to split my dataset longitudinally by year into training (entries from year <= 2004) and testing (>2004). I know how to split using caret for random sampling but am not sure how to modify for a particular year.

The time column is in year-month-day and time sample was collected.

I have looked into the createTimeSlices function in the caret package but do not understand how to specify a particular year to slice by. It appears that createTimeSlices is meant for Cross Validation?

Any ideas of a package I can use to solve this?

KanyeNE
  • 1
  • 1
  • Using `base`: `split(your_data, your_data$year <= 2004)`. This will give you a 2-element list containing the pre-2004 data as one element and the post-2004 data as the other. – Gregor Thomas Jul 10 '19 at 17:24
  • The "year" column I have is in year-month-day followed by time the sample was taken. When I run the split you mentioned it gives me a list of 0. – KanyeNE Jul 10 '19 at 17:32
  • Here's a question on extracting the year from the date: https://stackoverflow.com/q/36568070/903061 – Gregor Thomas Jul 10 '19 at 17:35
  • If you are using a date time class like `POSIXct`, then `split(your_data, as.numeric(format(your_data$date_column,'%Y')) <= 2004)`. If your data is just a string or factor, then you can use substrings (as suggested in an answer at the link above), or convert it to POSIX and use the code above. If you need more help, please post a reproducible sample of data using `dput()` so it is copy/pasteable and all structural info is preserved. There's lots of good advice and tips for that [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Gregor Thomas Jul 10 '19 at 17:38
  • Thank you for your help! Can I add a pipe onto that code to assign TRUE and FALSE to test and train datasets? – KanyeNE Jul 10 '19 at 18:39
  • If you just want to add a column no need to split. `your_data %>% mutate(is_train = as.numeric(format(date_column,'%Y')) <= 2004)` – Gregor Thomas Jul 10 '19 at 19:10
  • @Gregor I did add the column to my data. However, I would also need to make two distinct datasets from this dataset as well. `d.train <- my_data[['TRUE']]` is what I initially tried but ran into an error. – KanyeNE Jul 10 '19 at 19:38
  • You've got a lot going wrong here - `'TRUE'` in quotes is just a word. `TRUE` without quotes is a boolean. `[[` is used to get one element of a list or one column of a data frame, so `my_data[['TRUE']]` would try to find a column named `"TRUE"` in your data - which doesn't exist, so there's an error. You might want to look up a basic tutorial for "how to subset a data frame", but something like `d.train <- my_data[my_data$is_train == TRUE, ]` would get you there.... (I leave in `== TRUE` for clarity here, since `is_train` is already either T or F you don't really need it) – Gregor Thomas Jul 10 '19 at 20:08
  • Or you could combine the pieces of code I've given you and `d_list = your_data %>% mutate(is_train = as.numeric(format(date_column,'%Y')) <= 2004) %>% with(split, is_train)`. Then you can use `d_list[[1]]` and `d_list[[2]]` as train and test. – Gregor Thomas Jul 10 '19 at 20:09
  • Thank you again for your help @Gregor ! Still working on improving my R proficiency but I think I managed to do what I needed. – KanyeNE Jul 10 '19 at 23:00

0 Answers0