-1

We have external data with daily values (pulled in that format to DB) that need to be added up to approximate monthly values, that align with another external dataset showing approximate monthly values (date range of YYYY-MM-DD in separate columns in that set). The new data forms basis for linear regression.

We would like to use R to:

  1. Add the daily date values in data1 in the range that meets monthly values of date range in data2. In other words, if "start" in Data2 = 2015-02-14 and "end" in Data2 = 2015-03-15, we want to know the total of daily values of a variable in columnX of data1 for the range 2015-02-14 to 2015-03-15.

can't figure out the logic function to automatically define parameters in Data1 from info in Data2. Is this equation close?

monthly=sum(data1$variable, if(data1$Date > Data2$StartDate &  data1$Date < Data2$endDate))

data1$variable is in rows with data1$Date.

The error on the equation above = Error: unexpected ')'

We can not figure out how to build this argument. Any assistance would be appreciated!

AESH
  • 1
  • 1
  • It is very difficult to help you, given the information provided. I'd suggest reading http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610 first and to revise your question. This will make it much more likely that you will get a helpful response. – coffeinjunky Mar 19 '15 at 16:26
  • dates are just formatted numeric values, so sum up the values where the `as.numeric(date)` is between x and y. another approach would be to `?cut` the dates to make factor bins based on month and sum for each bin – rawr Mar 19 '15 at 16:33
  • coffeeinjunky - thanks for the suggestion. Unfortunately getting to the point of generating errors is not even possible - very very new to R. Taking some courses, and can follow the concepts but what we want to do now is perhaps too far advanced for current skill level (this could have been completed in spreadsheet hours ago...) Want to learn, but maybe need more basics before we can proceed. Will try to reproduce the data and rephrase question. – AESH Mar 19 '15 at 17:04

1 Answers1

0

Assume you have data1 and data2 as below

daysInData1 <- seq(as.Date('2013-03-1'), as.Date('2014-12-07'), by = 'day')
data1 <- data.frame(Date = daysInData1, variable = runif(length(daysInData1))) 

daysInData2 <- seq(as.Date('2013-03-15'), as.Date('2015-03-14'), by = 'month')
data2 <- data.frame(StartDate = daysInData2, volume = seq(length(daysInData2)), )

They should look like

>data1
    Date       variable
1   2013-03-01 0.944390132092
2   2013-03-02 0.168255153345
3   2013-03-03 0.919271149905
4   2013-03-04 0.456344844541
5   2013-03-05 0.365338093136
6   2013-03-06 0.158996492159
(...omit the rest)

>data2
   StartDate volume
1  2013-03-15      1
2  2013-04-15      2
3  2013-05-15      3
(...omit the rest)

You can use for loop to categorize data1 based on data2 date range

for( i in 1:nrow(data2))
{
    data1[data1$Date >= data2[(i), 'StartDate'] & data1$Date < data2[nrow(data2), 'StartDate'], 'DateMonthlySeg'] <- data2[i, 'StartDate']
}

Then aggregate to get the sum within each category in data1, and merge with data2 (for regression convenience)

data2 <- merge(data2, aggregate(variable ~ DateMonthlySeg, data = data1, sum), 
           by.x = 'StartDate', by.y = 'DateMonthlySeg')

Finally, perform linear regression

>lm(volume~variable, data = data2)

Call:
lm(formula = volume ~ variable, data = data2)

Coefficients:
(Intercept)     variable  
10.33248635   0.04394532
SeaSprite
  • 564
  • 8
  • 12
  • Thank you SeaSprite. Running those scripts did help - but with what I am still trying to figure out! For the initial creation of data frames - I ran the scripts but the output in "variable" column does not = what I have in that column. What is this value representative of? For the monthly "volume" i see this is a count of the months (1 - 13 in my specific case) – AESH Mar 19 '15 at 18:15
  • I can replicate the creation of your data2, but not data1 - variable produces different numbers on my machine when using the exact same code: Date variable 1 2013-03-01 0.241309647 2 2013-03-02 0.541467690 3 2013-03-03 0.574910429 4 2013-03-04 0.158847045 5 2013-03-05 0.882924404 ... – AESH Mar 19 '15 at 18:22
  • Yes, it should be different numbers. `runif` returns uniform distributed random numbers – SeaSprite Mar 19 '15 at 20:05