I want to calculate the difference between two dates as a dependent variable in my regression model. The contents of the two dates are stored in separate columns - one each for year, month, and day. And those variables are classified as numeric. My attempt to make this work includes tidying the data by removing all NA's and then classifying the variables as dates:
```{r}
movies2 <- na.omit(movies)
theater_year <- as_date(movies2$thtr_rel_year)
theater_month <- as_date(movies2$thtr_rel_month)
theater_day <- as_date(movies2$thtr_rel_day)
dvd_year <- as_date(movies2$dvd_rel_year)
dvd_month <- as_date(movies2$dvd_rel_month)
dvd_day <- as_date(movies2$dvd_rel_day)
```
Then to create a new column in my data set that takes the difference between the two dates:
```{r}
moviesclean <- within(movies2, {datediff <- c(dvd_year, dvd_month, dvd_day) - c(theater_year, theater_month, theater_day)})
```
This generates the message: replacement element 1 has 1857 rows to replace 619 rows
When I set up and run my regression model:
```{r}
model1 <- lm(datediff ~ genre + title_type + critics_score + imdb_rating + best_pic_nom + best_pic_win, data = movies2)
```
I receive the following error: Error in model.frame.default(formula = datediff ~ genre + title_type +: variable lengths differ (found for 'genre')
It appears I am tripling the length of the new column because I'm adding three columns together. How do I avoid this?