0

Trying to encode cyclical features for a ML algorithm, where the timestamp feature is very important as feature.

I want to transform the day_in_month ('day' column of cyclic_df) into a cyclical variable, so that the 1st of a month is after the last day of a the previous. So 01. February (01.02) is nearer to 31 January (31.01) and thus the difference between the 2 days, if you consider just the day column, is 1 and not 30!

# Transform the cyclical features 
cyclic_df['min_sin'] = np.sin(cyclic_df.minute*(2.*np.pi/59))       # Sinus component of minute 
cyclic_df['min_cos'] = np.cos(cyclic_df.minute*(2.*np.pi/59))       # Cosinus component of minute 
cyclic_df['hr_sin'] = np.sin(cyclic_df.hour*(2.*np.pi/23))          # Sinus component of hour 
cyclic_df['hr_cos'] = np.cos(cyclic_df.hour*(2.*np.pi/23))          # Cosinus component of hour

cyclic_df['d_sin'] = np.sin(cyclic_df.day*(2.*np.pi/30))            # !!!Sinus component of day!!!! Help here
cyclic_df['d_cos'] = np.cos(cyclic_df.day*(2.*np.pi/30))            # !!!Cosinus component of day!!! Help here

cyclic_df['mnth_sin'] = np.sin((cyclic_df.month-1)*(2.*np.pi/12))   # Sinus component of minute 
cyclic_df['mnth_cos'] = np.cos((cyclic_df.month-1)*(2.*np.pi/12))   # Cosinus component of minute

The problem is with that 30 with which I divide. Not every month has 30 days, there are months with 30, 31, 28 or 29 days. In each row in cyclical_df, I have a column 'month', a column 'year', and a column 'day'. So theoritically, there should be a solution to read the right number of days for that given month. How can I replace that 30 (line 5 & line 6 in code above), with the right variables, so it reads from the other columns the year and month, and replaces with the right value, and not always 30?

PS: It would be very nice, if someone could tell me, if I am doing right for the minute, hour and month, also available in the code above.

EDIT (after comments): Yes, I have a 'year' column. And changing the two line to:

cyclic_ext_df['d_cos'] = np.cos(cyclic_ext_df.day*(2.*np.pi/monthrange(cyclic_df.year, cyclic_ext_df.month)[1]))
cyclic_ext_df['d_cos'] = np.cos(cyclic_ext_df.day*(2.*np.pi/monthrange(cyclic_df.year, cyclic_ext_df.month)[1]))

I get following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-575-532a308075e2> in <module>()
     11 #cyclic_ext_df['d_cos'] = np.cos(cyclic_ext_df.day*(2.*np.pi/30))            # Cosinus component of day
     12 
---> 13 cyclic_ext_df['d_cos'] = np.cos(cyclic_ext_df.day*(2.*np.pi/monthrange(cyclic_df.year, cyclic_ext_df.month)[1]))
     14 cyclic_ext_df['d_cos'] = np.cos(cyclic_ext_df.day*(2.*np.pi/monthrange(cyclic_df.year, cyclic_ext_df.month)[1]))
     15 

~/anaconda/lib/python3.6/calendar.py in monthrange(year, month)
    120     """Return weekday (0-6 ~ Mon-Sun) and number of days (28-31) for
    121        year, month."""
--> 122     if not 1 <= month <= 12:
    123         raise IllegalMonthError(month)
    124     day1 = weekday(year, month, 1)

~/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in __nonzero__(self)
   1574         raise ValueError("The truth value of a {0} is ambiguous. "
   1575                          "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
-> 1576                          .format(self.__class__.__name__))
   1577 
   1578     __bool__ = __nonzero__

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
ZelelB
  • 1,836
  • 7
  • 45
  • 71

2 Answers2

1

if you have year and month in your data, you can use calendar.monthrange:

from calendar import monthrange

month = 2
year = 2014

_, mr = monthrange(year, month)
cyclic_df['d_cos'] = np.cos(cyclic_df.day*(2.*np.pi/mr))
  • yes, but how to apply it to all values of the dataframe column, as I am doing with the lines above... With one line, I am applying that to all values of the column. But now it's 30, and I want it dynamic, depending on the month and year in that row. – ZelelB Dec 06 '18 at 20:56
  • It is hard to figure out without seeing actual dataframe. But is there year values available by lookup, like that `cyclic_df.month`? Something like: `cyclic_df['d_cos'] = np.cos(cyclic_df.day*(2.*np.pi/monthrange(cyclic_df.year, cyclic_df.month)[1]))` – Alexey Bogomolov Dec 06 '18 at 22:39
  • yes, I have a year column in the dataframe. cyclic_df.year But getting an error when changing the code to your proposition. Edited the question with the code & error. – ZelelB Dec 07 '18 at 11:41
  • 1
    Then you have month number in a different format then supported by monthrange function. Check what is returned when you evaluate `cyclic_ext_df.month`. Probably, you'll have to convert it to integer to make it work. The same goes for year number. It has to be integer too. – Alexey Bogomolov Dec 07 '18 at 20:41
1

I don't really understand what you're doing with trigonometry - either you're not explaining your goal well, or you are over-engineering the solution.

The year/month/day convention is a human convenience. For straightforward comparisons of days, time is measured using number of time units since an agreed-upon epoch. The most common case of this is the Unix timestamp, which counts seconds since Jan 1, 1970.

You therefore have two options:

  • You can convert all times to Unix timestamps, then convert them from seconds to days.
    • Converting date to timestamp is explained here. That question assumes parsing a string, but you can instantiate datetime with actual date values also.
    • If s is seconds, you can get the number of days with d = s/(24*60*60)
  • You can switch to your own day-based system.
    • After setting an arbitrary "epoch date", you can get the number of days between the epoch and any date in your table as described here.
Wassinger
  • 347
  • 2
  • 16