5

How do you prepare cyclic ordinal features like time in a day or day in a week for the random forest algorithm?

By just encoding time with minutes after midnight the information difference between 23:55 and 00:05 will be very high although it is only 10 minutes difference.

I found a solution here where the time feature is split in to two features using cosine and sine of the seconds after midnight feature. But will that be appropriate for random forest? With using random forest one can't be sure that all features will be present for every split. So often there will be half of the time information missing for a decisions.

Looking forward to you thoughts!

Max
  • 51
  • 1
  • 1
    Exactly the same problem here! Unfortunately noone answered this very important question! – ZelelB Nov 25 '18 at 14:04

1 Answers1

0

If you have a date variable, with values like '2019/11/09', you can extract individual features like year (2019), month (11), day (09), day of the week (Monday), quarter (4), semester (2). You can go ahead and add additional features like "is bank holiday", "is weekend", or "advertisement campaign", if you know the dates of specific events.

If you have a time variable with values like 23:55, you can extract hr (23), minutes (55) and if you had, seconds, nanoseconds etc. If you have info about the timezone, you can also get this.

If you have datetime variable with values like '2019/11/09 23:55', you can combine the above.

If you have more than 1 datetime variable, you can capture differences between them, for example if you have date of birth, and date of application, you can determine the feature "age at time of application".

More info about the options for datetime can be found in pandas dt module. Check methods here.

The cyclical transformation in your link is used to re-code circular variables like hrs of a day, or months of the year, where for example December (month 12) is closer to January (month 1) than to July (month 7), whereas if you encoded with numbers, this relationship is not captured. You would use this transformation if this is what you want to represent. But this is not the standard go method to transform this variables (to my knowledge).

You can check Scikit-learn's tutorial on time related feature engineering.

Random forests capture non-linear relationships between features and targets, so they should be able to handle both numerical features like month, or the cyclical variation.

To be absolutely sure, the best way is to try both engineering methods and see which feature returns better model performance.

You can apply the cyclical transformation straightaway with the open source package Feature-engine. Check the CyclicalTransformer.

Sole Galli
  • 827
  • 6
  • 21