How to close the gap between summarized predicted data and monthly truth data.

Question

This question is somewhat of a follow up question to an earlier one I had here: R: Summarize inside an nlsLM() statement

With the answer supplied, I am able to make appropriate minutely predictions when only given monthly data. However, this being said, when averaging these minutely predictions I still don't line up correctly. For example, consider the following dataframe:

structure(list(datetime = structure(c(1483246800, 1491022800, 
1514782800, 1514782800, 1491022800, 1512104400, 1506834000, 1519880400, 
1488344400, 1509512400, 1496293200, 1517461200, 1517461200, 1491022800, 
1498885200, 1483246800, 1509512400, 1493614800, 1509512400, 1491022800, 
1498885200, 1514782800, 1514782800, 1496293200, 1498885200, 1496293200, 
1491022800, 1483246800, 1491022800, 1509512400, 1488344400, 1504242000, 
1493614800, 1506834000, 1519880400, 1522558800, 1491022800, 1498885200, 
1485925200, 1512104400, 1493614800, 1498885200, 1498885200, 1522558800, 
1512104400, 1488344400, 1498885200, 1517461200, 1501563600, 1509512400
), class = c("POSIXct", "POSIXt"), tzone = ""), value = c(51682.4946266667, 
40183.82222, 51916.3440866667, 51916.3440866667, 40183.82222, 
52104.55914, 59971.29032, 50324.4333333333, 21255.3763466667, 
63782.1333333333, 50823.3333333333, 59876.3333333333, 59876.3333333333, 
40183.82222, 51522.6666666667, 51682.4946266667, 63782.1333333333, 
46994.1506666667, 63782.1333333333, 40183.82222, 51522.6666666667, 
51916.3440866667, 51916.3440866667, 50823.3333333333, 51522.6666666667, 
50823.3333333333, 40183.82222, 51682.4946266667, 40183.82222, 
63782.1333333333, 21255.3763466667, 22638.7333333333, 46994.1506666667, 
59971.29032, 50324.4333333333, 56303.6666666667, 40183.82222, 
51522.6666666667, 36981.3333333333, 52104.55914, 46994.1506666667, 
51522.6666666667, 51522.6666666667, 56303.6666666667, 52104.55914, 
21255.3763466667, 51522.6666666667, 59876.3333333333, 38082, 
63782.1333333333), mon = structure(c(2017, 2017.25, 2018, 2018, 
2017.25, 2017.91666666667, 2017.75, 2018.16666666667, 2017.16666666667, 
2017.83333333333, 2017.41666666667, 2018.08333333333, 2018.08333333333, 
2017.25, 2017.5, 2017, 2017.83333333333, 2017.33333333333, 2017.83333333333, 
2017.25, 2017.5, 2018, 2018, 2017.41666666667, 2017.5, 2017.41666666667, 
2017.25, 2017, 2017.25, 2017.83333333333, 2017.16666666667,2017.66666666667, 
2017.33333333333, 2017.75, 2018.16666666667, 2018.25, 2017.25, 
2017.5, 2017.08333333333, 2017.91666666667, 2017.33333333333, 
2017.5, 2017.5, 2018.25, 2017.91666666667, 2017.16666666667, 
2017.5, 2018.08333333333, 2017.58333333333, 2017.83333333333), class ="yearmon"), MW = c(6.8193905496418, 2.23569140013909, 7.29307427883793, 
3.89520682166607, 0, 6.83392781148047, 6.79889072887827, 
7.04895914386027, 0, 0, 4.94768283491513, 7.25490198653147, 
5.9299826558944, 2.29357235334133, 0, 0, 0, 2.59727606832981, 
0, 5.7905706950062, 0, 6.47212567650073, 0, 6.11914427796473, 
6.5166676962948, 6.12862205050107, 6.3110616267544, 6.19377239265973, 
5.83618844693073, 7.12230897980393, 0, 0, 0, 3.55167797993943, 
7.45808446388613, 0, 0, 0, 0.95175263904598, 0.924012542790133, 
0, 6.215125001929, 6.79951155611973, 0, 0, 0, 0, 6.6579396958766, 
0, 3.69549937335895)), .Names = c("datetime", "value", "mon", "MW"), row.names = c(5226L, 8762L, 21233L, 21545L, 7433L, 16167L, 
18052L, 277L, 5887L, 20047L, 13166L, 22224L, 22817L, 8102L, 9040L, 
5000L, 19304L, 11169L, 19027L, 7565L, 9053L, 20963L, 20856L, 
12311L, 9980L, 12456L, 7511L, 4889L, 7574L, 20438L, 6088L, 3178L, 
10639L, 18061L, 1273L, 1729L, 8194L, 9145L, 13831L, 16145L, 10618L, 
9558L, 9879L, 1747L, 17121L, 7014L, 9303L, 22813L, 15951L, 19159L), class = "data.frame")

If we run the code supplied in the link given above on the selected dataframe, then plot it using the below code (using ggplot2):

p = ggplot(joined, aes(x = datetime))
p = p + labs(x = 'Date', y = 'Predicts',
           title = 'Model vs. Truth Comparison')
p = p + geom_line(aes(y = joined$monthly_avg_prediction), color = 'blue')
p = p + geom_line(aes(y = joined$value), color = 'red')

Then we can see that the blue line (our predicted monthly values) are generally lower than the truth values. This is easier to see with the complete dataset, however that would be too much to post for a reproducible example on here. If you would like to visualize exactly what comes about, here is an example run from a full set of data:

My initial way of combatting this was by using the lm() function in R to adjust the predictions in the correct direction with the following code:

adjust = lm(value~monthly_avg_prediction + 0, data = joined)
linadj = coef(adjust)
joined$monthly_avg_prediction = (joined$monthly_avg_prediction) * linadj[1]

My question is whether or not there is a better way of doing this that may give more accurate model predictions. This way definitely gets me close and does so with sufficient p values, but I want to be sure of it's validity.

Thanks in advance for any advice on the matter!

Hace you tried to run the `ggplot` without the `joined$` in the `geom_line`-calls? — kath, Aug 20 '18 at 14:11
Yes, but that doesn't address my issue. I mostly did that for organizational purposes. — obewanjacobi, Aug 20 '18 at 14:14
Ok, just the first thing that came to mind... actually you shouldn't use this as it often messes up stuff and if the column is already in you adressed data than you really don't need to again specify this with `joined$` — kath, Aug 20 '18 at 14:16

How to close the gap between summarized predicted data and monthly truth data.

0 Answers0