2

I'm using nlsLM() to make a model of a power function, but I need to summarize my data within the function call to find the appropriate coefficient and exponent. More specifically, here is what my model code looks like:

Jmod = nlsLM(value~(a)*summarise(funs(mean), (MW)^b),
                 start = list(a=100000, b = 1/3), data = mod_data,
                 upper = c(Inf,1), lower = c(0,1/5))

where MW is the data I am trying to use to predict value. The data for MW is already grouped by month based off a variable called datetime, so I would like to take the monthly average of MW^b where b is found by the nlsLM() statement.

I would take the average beforehand, but as you may realize, this is not mathematically equivalent [ie. ((a+c)/2)^b is not equal to (a^b + c^b)/2].

If anyone has any info on how to do this, I would be greatly appreciative!


EDIT:

Below is the code to make a sample data set of what I'm trying to work with:

structure(list(datetime = structure(c(1514782800, 1480568400,1504242000, 1509512400, 1509512400, 1485925200, 1517461200, 1485925200, 1501563600, 1467349200, 1472706000, 1454302800, 1483246800, 1498885200, 1506834000, 1477976400, 1483246800, 1477976400, 1509512400, 1496293200, 1451624400, 1454302800, 1454302800, 1464757200, 1498885200, 1517461200, 1462078800, 1506834000, 1522558800, 1483246800, 1501563600, 1451624400, 1485925200, 1501563600, 1451624400, 1517461200, 1475298000, 1480568400, 1512104400, 1456808400, 1477976400, 1475298000, 1517461200, 1459486800, 1501563600, 1477976400, 1506834000, 1506834000, 1451624400, 1483246800), class = c("POSIXct", "POSIXt"), tzone = ""), value = c(2863.27837518519, 2878.40382333333, 1236.74444444444, 3522.48888888889, 3522.48888888889, 2033.55555555556, 3305.5, 2033.55555555556, 2094.7037037037, 3052.91875740741, 2960.52222222222, 1733.7918262963, 2850.28673851852, 2841.40740740741, 3310.77538814815, 2266.26172851852, 2850.28673851852, 2266.26172851852, 3522.48888888889, 2802.55555555556, 2196.82556740741, 1733.7918262963, 1733.7918262963, 3001.43703703704, 2841.40740740741, 3305.5, 2061.4826762963, 3310.77538814815, 3107.01851851852, 2850.28673851852, 2094.7037037037, 2196.82556740741, 2033.55555555556, 2094.7037037037, 2196.82556740741, 3305.5, 2848.90322592593, 2878.40382333333, 2873.73476703704, 2208.64755074074, 2266.2172851852, 2848.90322592593, 3305.5, 2021.68765444444, 2094.7037037037, 2266.26172851852, 3310.77538814815, 3310.77538814815, 2196.82556740741, 2850.28673851852), mon = structure(c(2018, 2016.91666666667, 2017.66666666667, 2017.83333333333, 2017.83333333333, 2017.08333333333, 2018.08333333333, 2017.08333333333, 2017.58333333333, 2016.5, 2016.66666666667, 2016.08333333333, 2017, 2017.5, 2017.75, 2016.83333333333, 2017, 2016.83333333333, 2017.83333333333, 2017.41666666667, 2016, 2016.08333333333, 2016.08333333333, 2016.41666666667, 2017.5, 2018.08333333333, 2016.33333333333, 2017.75, 2018.25, 2017, 2017.58333333333, 2016, 2017.08333333333, 2017.58333333333, 2016, 2018.08333333333, 2016.75, 2016.91666666667, 2017.91666666667, 2016.16666666667, 2016.83333333333, 2016.75, 2018.08333333333, 2016.25,2017.58333333333,2016.83333333333, 2017.75, 2017.75, 2016, 2017), class = "yearmon"), 
MW = c(2.6142700774997, 4.02670249993547, 0.666666666666667, 
0.724114015571947, 4.07197668868287, 3.74122386862433, 3.30097429092907, 
3.84858110028323, 0.666666666666667, 0.666666666666667, 4.35000446878457, 
0.666666666666667, 0.666666666666667, 3.8371824280444, 0.825077317374, 
0.666666666666667, 4.028058457579, 0.666666666666667, 4.3378032532779, 
3.84270845997837, 1.40955788986009, 0.666666666666667, 0.666666666666667, 
4.05845600900597, 4.00664052392117, 4.0295346724872, 0.666666666666667, 
4.14159923664523, 4.231951299842, 3.9562222817766, 0.666666666666667, 
3.61602795165213, 0.666666666666667, 3.58079262746603, 4.12197770915903, 
4.2610646492437, 4.02152528469467, 1.0117763092792, 2.03648922832252, 
0.666666666666667, 0.666666666666667, 3.8042476910097, 3.91787334748133, 
0.666666666666667, 0.666666666666667, 0.89571472289964, 4.1530002677697, 
3.93733212731873, 0.710671314318797, 0.666666666666667)), .Names = c("datetime", "value", "mon", "MW"), row.names = c(39113L, 12946L, 4365L, 37505L, 36601L, 31055L, 39814L, 31433L, 32105L, 20668L, 18191L, 8328L, 10232L, 25689L, 35528L, 4577L, 10302L, 5146L, 37975L, 29670L, 28429L, 7932L, 8468L, 23120L, 25111L, 39699L, 24312L, 36246L, 1556L, 11068L, 33269L, 29163L, 31685L, 32419L, 29059L, 40618L, 16751L, 11737L, 34371L, 6001L, 4864L, 16413L, 40304L, 8716L, 33190L, 5399L, 35610L, 36462L, 28338L, 10371L), class = "data.frame")

This will create the mod_data that I am using to make the model. And just to reiterate, what I have done here is group data by month, found in the mon column of my data, and now I want to summarize the data by month, but with the exponent included in the mean, as seen in my above code. Thanks again!

obewanjacobi
  • 458
  • 2
  • 12
  • 1
    Can you reduce your problem to a [reprodubile example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and include mock data so that others can better understand your issue? – Curt F. Aug 08 '18 at 20:19
  • I have edited in hopes to address your concerns @CurtF. Please let me know if there's anything else I can do to move this along. – obewanjacobi Aug 09 '18 at 17:21
  • I should also say that I am not required to use the summarize function here, or dplyr at all for that matter. I am only using it because the `group_by()` function has already grouped my data based off of the `mon` variable in the dataframe. – obewanjacobi Aug 13 '18 at 20:25
  • Which is the package of the `nlsLM()` function? – s__ Aug 14 '18 at 11:47
  • @s_t it is in the `minpack.lm` package – obewanjacobi Aug 14 '18 at 12:45

1 Answers1

3

What you want isn't clear from the question. Do you want to fit a separate model for every unique month in your data? Or do you want to fit one model for all of the data and then take monthly averages of the value of MW^b ?

Here's one way on how to do the latter case.

require(minpack.lm)
require(tidyverse)
require(broom)

dat <- structure(...) # provided in the question

predictions <- 
    dat %>% 
    ungroup %>%
    mutate(row = row_number()) %>%
    do(augment(nlsLM(
                formula = value ~ a * MW^b + 0*row, 
                data = .,
                start = list(a = 100000, b=1/3),
                upper = c(Inf, 1), 
                lower = c(0, 1/5)
               )
           )
       )

joined <- 
    dat %>%
    mutate(row = row_number()) %>%
    left_join(predictions, by=c('MW', 'value', 'row')) %>%
    select(-row)

joined %>%
    group_by(mon) %>%
    mutate(monthly_avg_prediction = mean(.fitted))

Notes:

  1. This kind of stuff is much easier with the broom package. This is because broom converts the results of model-finding functions like lm, nls, or nlsLM etc. into dataframes. So you don't have to memorize or re-lookup the function-specific structure of the model object (e.g. model$params[['estimate']][[1]]) or similar stuff; the model results are already in R-standard dataframe format.
  2. My solution uses ideas from this StackOverflow answer on how to join broom-generated dataframes of predictions with original data. That's why the stuff with row_number() and left_join() are in there. Otherwise, in the general case, augment will throw away data from the original dataframe that is not used in the model prediction, and it will not work well if there are repeated values in the data that is used.
  3. The .fitted column is generated by broom's augment function. It is the model prediction at the indicated datapoint.
  4. The result (that I think you might want) is in the monthly_avg_prediction column of the joined dataframe. But that represents a single, global model, fit on all the data, and predictions from that model are averaged by month.
Curt F.
  • 4,690
  • 2
  • 22
  • 39
  • Thanks for the answer! My goal here is to fit one model for all of the data and then take monthly averages, and the catch is that I only have monthly averages as data to create my model. Does that make sense? If I understand correctly, this is what you answered, but I want to make sure before jumping to conclusions. – obewanjacobi Aug 14 '18 at 18:00
  • 1
    Yep, that's what I did. – Curt F. Aug 14 '18 at 18:03
  • 1
    Also, if you want to see the model results, change the `do(augument(` to `do(tidy(` for parameter estimates or `do(glance(` for global quality-of-fit information. – Curt F. Aug 14 '18 at 18:08
  • Now the problem is that I need coefficients that line up with the `monthly_avg_prediction` column that you find here in your code, but the coefficients for `c` and `b` that I get by using the `do(tidy(` command are in reference to before the averages happen. Does this make sense? Thanks in advance! – obewanjacobi Aug 14 '18 at 19:25
  • Well if you are only fitting one model, there is only one `b` coefficient and one `c` (or is it `a`) coefficient for the whole data. – Curt F. Aug 14 '18 at 20:31
  • correct, but the (let's call it `a`, sorry about that) `a` value and `b` value I get from doing it this way work for only the monthly averages. I need a model that works for the data on a minutely basis as well. The model needs to be accurate by the minute, using only the monthly averages to make it. This is why I need the model to take the averages of the MW^b values before making the model. Does this make sense? If not, we could move this message to a chat to try and get this figured out. I will be available most of the day, just let me know. – obewanjacobi Aug 15 '18 at 12:39
  • 1
    I still have no idea what you are trying to do. I don't think you have supplied a reproducible example. If you have a new question you should probably ask a new question. – Curt F. Aug 15 '18 at 22:06
  • Disregard my comment from yesterday, the averages are being taken at the correct time. It was my own error that I was encountering. However, if you were to plot the predicted monthly values (what we're calling `monthly_avg_prediction`) and the truth data (labelled value above) you will notice that the monthly predictions are consistently lower than the truth values. If I understand correctly, this comes from the fact that we are making predictions to the monthly values and not the minutely values that are given. Is there a way to combat this? – obewanjacobi Aug 17 '18 at 13:59