0

I need use multi-variate linear regression for my project where I have two dependent variables: mean_1 and mean_2. The independent variable is date in YYYY-mm-dd format. I have been going through various stackoverflow posts to understand how to use date as a variable in regression. some suggest to convert date to numerical value (https://stackoverflow.com/a/40217971/13713750) while the other option is to convert date to dummy variables.

What I don't understand is how to convert every date in the dataset to a dummy variable and use it as an independent variable. Is it even possible or there are any better ways to use date as independent variable?

Note: I would prefer using date in the date format so it would be easy to plot and analyse the results of regression. Also I am working with pyspark but I can switch to pandas if necessary. so any examples of implementations would be helpful. Thanks!

Samiksha
  • 139
  • 2
  • 9

2 Answers2

0

You could create new columns year, month, day_of_year, day_of_month, day_of_week. You could also add some binary columns like is_weekday, is_holiday. In some cases it is beneficial to add third party data, like daily weather statistics for example (I was working on a case where extra daily weather data proved very useful). Really depends on the domain you're working on. Any of those columns could unveil some pattern behind your data.
As for Dummy variables, converting month and day_of_week to dummies makes sense.
Another option is, to build a model for each month.

Mark
  • 532
  • 2
  • 6
  • thank you for looking into it. Unfortunately I cannot categorize the dates to is_weekday or is_holiday as it will not be helpful in my case. – Samiksha Feb 10 '21 at 19:34
0

If you want to transform a date to numeric (but I don't recommend) you can do this:

pd.to_timedelta(df.date).dt.total_seconds().astype(int)

You can do the same but with the total number of seconds:

pd.to_timedelta(df.date).dt.total_seconds()

Also, you can use a baseline date and subtract that from your date variable and obtain the number of days, this will give you an integer number that makes sense (bigger difference means a date more into the future, while smaller difference shows older dates). This value makes sense for me to use as an independent variable in a model.

First, we create a baseline date (can be whatever you want), and add it to the dataframe to the column static:

df['static'] = pd.to_datetime(datetime.date(2017, 6, 28))

Then we obtain the difference of days of the static date vs your date

df['days'] = (df['static'] - df['date']).dt.days

And there you will have a number ready to be used as an independent variable

DanCor
  • 308
  • 2
  • 12
  • thank you for looking into it. Your idea suggests that I use no. of days as independent variable, but I was hoping for a solution where I could still use date as date (not days) for further analysis, example, scatter plots, etc – Sameeksha Sohal 13 hours ago – Samiksha Feb 11 '21 at 08:47
  • just for my understanding, you are looking to predict the date in your lineal regression? – DanCor Feb 11 '21 at 14:18
  • no, date is the independent variable in my case. I need to analyse if there is an effect/ relationship of date on mean_1 and mean_2 using regression. – Samiksha Feb 11 '21 at 15:37
  • in that case, wouldn't a time series be much better than using a linear regression? because then you can use the dates without changing them – DanCor Feb 11 '21 at 19:59
  • Can you share some examples? because I have been trying to find some time series regression models where I can use multiple dependent variables but found models for multiple independent variables. – Samiksha Feb 12 '21 at 11:31
  • is it not possible to use 1 model for each dependant variable? in my understanding, all time series only have 1 target variable, so if you want to predict more than one, you would use a model for each – DanCor Feb 12 '21 at 13:08
  • I guess I will have to look for each dependent variable separately. thanks :) – Samiksha Feb 14 '21 at 20:47