5

I am building a KNN model to predict housing prices. I'll go through my data and my model and then my problem.

Data -

# A tibble: 81,334 x 4
   latitude longitude close_date          close_price
      <dbl>     <dbl> <dttm>                    <dbl>
 1     36.4     -98.7 2014-08-05 06:34:00     147504.
 2     36.6     -97.9 2014-08-12 23:48:00     137401.
 3     36.6     -97.9 2014-08-09 04:00:40     239105.

Model -

library(caret)
training.samples <- data$close_price %>%
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]

model <- train(
  close_price~ ., data = train.data, method = "knn",
  trControl = trainControl("cv", number = 10),
  preProcess = c("center", "scale"),
  tuneLength = 10
)

My problem is time leakage. I am making predictions on a house using other houses that closed afterwards and in the real world I shouldn't have access to that information.

I want to apply a rule to the model that says, for each value y, only use houses that closed before the house for that y. I know I could split my test data and my train data on a certain date, but that doesn't quite do it.

Is it possible to prevent this time leakage, either in caret or other libraries for knn (like class and kknn)?

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
goollan
  • 765
  • 8
  • 19
  • Can you make a reproducible example? https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – bbiasi May 27 '19 at 16:54

1 Answers1

4

In caret, createTimeSlices implements a variation of cross-validation adapted to time series (avoiding time leakage by rolling the forecasting origin). Documentation is here.

In your case, depending on your precise needs, you could use something like this for a proper cross-validation:

your_data <- your_data %>% arrange(close_date)

tr_ctrl <- createTimeSlices(
  your_data$close_price, 
  initialWindow  = 10, 
  horizon = 1,
  fixedWindow = FALSE)

model <- train(
  close_price~ ., data = your_data, method = "knn",
  trControl = tr_ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10
)

EDIT: if you have ties in the dates and want to having deals closed on the same day in the test and train sets, you can fix tr_ctrl before using it in train:

filter_train <- function(i_tr, i_te) {
  d_tr <- as_date(your_data$close_date[i_tr]) #using package lubridate
  d_te <- as_date(your_data$close_date[i_te])
  tr_is_ok <- d_tr < min(d_te)

  i_tr[tr_is_ok]
}

tr_ctrl$train <- mapply(filter_train, tr_ctrl$train, tr_ctrl$test)
Pierre Gramme
  • 1,209
  • 7
  • 23
  • When you predict the `i`th close price, in this example, doesn't the model still have access to homes that closed after that one? – goollan May 27 '19 at 17:10
  • No, only to the $i-1$ records before. See the picture in the link I included – Pierre Gramme May 27 '19 at 17:15
  • I have multiple homes that close on the same day. When you set horizon = 1, does that mean the time slice will go until the next home or the next close date? I'm thinking I might want to set the first argument to my_data$close_date. – goollan May 27 '19 at 17:23
  • If you check with `View(caret::createTimeSlices)`, you will see that the first argument is only used for its length, so your suggestion won't work. Sorry but I don't have a direct answer for when there are ties in the timepoints – Pierre Gramme May 28 '19 at 07:25