R: Sum values based on 2 condition from 2 data frame of different length

Question

I'm trying to get the total & average of some variables based on date. I have fuel data & driving data of several vehicles. Fuel data consist of several dates whereas driving data consist of more dates (Logically you'll only refuel after x trips). My end result would be to get the sum/average driving data based on fuel dates.

Fuel data:

plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345") 
date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08", "2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18")
liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48) 
df1 = data.frame(plate, date, liter)

Driving data:

plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345") 
date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30")
mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39)
accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9)
df2 = data.frame(plate, date, mileage,accx)

Merging both data

df.all = left_join(df2, df1, by.x =c("plate", "date"))

I'd like to get the total mileage(sum) and average accx based on fuel dates. The end result should look like this:

Expected data frame

Is there a way to use dplyr to mutate the desired results? FYI I only need the mutated parameters with their plates (result line 1,2,4,6,8,etc isn't needed) Thanks in advance!

Please do not post *images* of data, just post the data itself: images require somebody helping you to transcribe data themselves (something I'm generally unwilling to do). Some references: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example and https://stackoverflow.com/help/mcve. — r2evans, Jan 17 '18 at 08:18
Sorry. It's just that I don't know the codes to generate the results. So I had to post an image of how I want it to look like — Zi-Xin, Jan 17 '18 at 08:22
I've not seen `left_join(..., by.x=...)`. The `by.x` resembles an argument to `merge`, but `dplyr::left_join` only accepts `by`. As for your expected data ... just create something manually and put it out same as you did your fuel and driving data frames. (The image resembles RStudio, so how did you get a representative data.frame without being able to paste generating code, ala `dput(head(x))`?) — r2evans, Jan 17 '18 at 08:42
i edited the image haha. i actually have over 10000 data per month for 12 months. manually calculating it (like i posted in the image) is gonna take me weeks. — Zi-Xin, Jan 17 '18 at 09:00
Can you use filtered or even made-up data? Asking questions on SO can be an art, and knowing how to simplify data and code to the minimal required is both useful to you (for understanding the process) and increases the likelihood that somebody will help you quickly. — r2evans, Jan 17 '18 at 09:03
Gotcha. Thanks for the tip! I'll include the made-up data in the future. — Zi-Xin, Jan 17 '18 at 09:11

score 2 · Accepted Answer · answered Jan 17 '18 at 08:45

There might be a more elegant way, but this works:

library(dplyr)

df.all %>% 
  mutate(date = as.Date(date)) %>% 
  group_by(plate) %>% 
  arrange(plate, date) %>% 
  mutate(t.mileage = cumsum(mileage) * !is.na(liter), 
         t.accx = cumsum(accx) * !is.na(liter), 
         n = seq_len(n())) %>% 
  filter(!is.na(liter)) %>% 
  mutate(t.mileage.lag = lag(t.mileage), 
         t.accx.lag = lag(t.accx),  
         n.lag = lag(n),
         t.mileage = ifelse(!is.na(t.mileage.lag), t.mileage - t.mileage.lag, t.mileage), 
         a.accx = ifelse(!is.na(t.accx.lag), (t.accx - t.accx.lag)/(n - n.lag), t.accx/n)) %>% 
  select(-t.mileage.lag, -t.accx.lag, -n.lag, -n, -t.accx)

# A tibble: 9 x 7
# Groups:   plate [3]
#   plate       date mileage  accx liter t.mileage     a.accx
#   <chr>     <date>   <dbl> <dbl> <dbl>     <dbl>      <dbl>
# 1 AB123 2017-09-08      45     4    33       159  2.3333333
# 2 AB123 2017-09-11      58     8    15       128  4.0000000
# 3 AB123 2017-09-13      22     2    28        63  6.5000000
# 4 AB123 2017-09-20      48    10    40       153  8.0000000
# 5 AC234 2017-09-06      70     7    43       164  7.3333333
# 6 AC234 2017-09-08      88     8    20       134  7.0000000
# 7 AC234 2017-09-15      51     1    25       162  0.6666667
# 8 AC234 2017-09-23      74     7    50       203  5.3333333
# 9 AD345 2017-09-18      65    15    48       351 10.6666667

Group by plate to only calculate the following per different plate. Then calculate the totale mileage and totale accx with cumsum, but only keep the values where we have non-missing liter. Also count how many drives we have with n. Then as we're only interested in the information where we fuel the car we filter by non-missing liter. Using lag subtract the previous total milage and accx from each value (unless there is no previous value, i.e. the lag is NA) and then calculate average accx.

Data

df1 <- 
  data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345"), 
             date = c("2017-09-08", "2017-09-11", "2017-09-13", "2017-09-20", "2017-09-06", "2017-09-08", 
                      "2017-09-15", "2017-09-23", "2017-09-10", "2017-09-18"), 
             liter = c(33, 15, 28, 40, 43, 20, 25, 50, 26, 48), 
             stringsAsFactors = F)

df2 <-
  data.frame(plate = c("AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AB123", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AC234", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345", "AD345"), 
             date = c("2017-09-01", "2017-01-05", "2017-09-08", "2017-09-10", "2017-09-11", "2017-09-12", "2017-09-13", "2017-09-16", "2017-09-17", "2017-09-20", "2017-09-22", "2017-09-25", "2017-09-02", "2017-09-03", "2017-09-06", "2017-09-07", "2017-09-08", "2017-09-09", "2017-09-13", "2017-09-15", "2017-09-17", "2017-09-20", "2017-09-23", "2017-09-25", "2017-09-01", "2017-09-04", "2017-09-09", "2017-09-12", "2017-09-15", "2017-09-18", "2017-09-19", "2017-09-20", "2017-09-23", "2017-09-27", "2017-09-30"),
             mileage = c(50, 64, 45, 70, 58, 41, 22, 15, 90, 48, 52, 48, 29, 65, 70, 46, 88, 71, 40, 51, 38, 91, 74, 61, 41, 33, 59, 81, 72, 65, 43, 81, 20, 49, 39), 
             accx = c(0, 3, 4, 0, 8, 11, 2, 5, 9, 10, 2, 22, 9, 6, 7, 6, 8, 1, 0, 1, 8, 1, 7, 6, 4, 3, 9, 11, 22, 15, 13, 1, 2, 4, 9), 
             stringsAsFactors = F)

df.all <- left_join(df2, df1, by = c("plate", "date"))

Thanks a lot! I didn't think of multiplying with !is.na(liter). And thanks for introducing the lag function. Saved me tons of time! — Zi-Xin, Jan 17 '18 at 09:08

score 1 · Answer 2 · answered Jan 17 '18 at 08:55

Although @kath provided a more convenient solution, here's a base R one (if only because I spend some time working on it):

# generate factor to split on
temp <- which(!is.na(df.all$liter))
vec <- temp - c(0, temp[-length(temp)])
df.all$split <- rep(seq(1, length(temp)+1), c(vec, nrow(df.all)-temp[length(temp)]))

# split df.all and calculate t.mileage and a.accx for each subsample
df.temp <- split(df.all, df.all$split)
t.mileage <- sapply(df.temp, function(x) sum(x[, "mileage"]))
a.accx <- sapply(df.temp, function(x) mean(x[, "accx"]))

# generate new variables and insert calculated values
df.all$t.mileage <- NA
df.all$t.mileage[temp] <- t.mileage[-length(t.mileage)]
df.all$a.accx <- NA
df.all$a.accx[temp] <- a.accx[-length(a.accx)]

# display df.all without splitting factor
df.all <- subset(df.all, select = -split)

> df.all
   plate       date mileage accx liter t.mileage     a.accx
1  AB123 2017-09-01      50    0    NA        NA         NA
2  AB123 2017-01-05      64    3    NA        NA         NA
3  AB123 2017-09-08      45    4    33       159  2.3333333
4  AB123 2017-09-10      70    0    NA        NA         NA
5  AB123 2017-09-11      58    8    15       128  4.0000000
6  AB123 2017-09-12      41   11    NA        NA         NA
7  AB123 2017-09-13      22    2    28        63  6.5000000
8  AB123 2017-09-16      15    5    NA        NA         NA
9  AB123 2017-09-17      90    9    NA        NA         NA
10 AB123 2017-09-20      48   10    40       153  8.0000000
11 AB123 2017-09-22      52    2    NA        NA         NA
12 AB123 2017-09-25      48   22    NA        NA         NA
13 AC234 2017-09-02      29    9    NA        NA         NA
14 AC234 2017-09-03      65    6    NA        NA         NA
15 AC234 2017-09-06      70    7    43       264  9.2000000
16 AC234 2017-09-07      46    6    NA        NA         NA
17 AC234 2017-09-08      88    8    20       134  7.0000000
18 AC234 2017-09-09      71    1    NA        NA         NA
19 AC234 2017-09-13      40    0    NA        NA         NA
20 AC234 2017-09-15      51    1    25       162  0.6666667
21 AC234 2017-09-17      38    8    NA        NA         NA
22 AC234 2017-09-20      91    1    NA        NA         NA
23 AC234 2017-09-23      74    7    50       203  5.3333333
24 AC234 2017-09-25      61    6    NA        NA         NA
25 AD345 2017-09-01      41    4    NA        NA         NA
26 AD345 2017-09-04      33    3    NA        NA         NA
27 AD345 2017-09-09      59    9    NA        NA         NA
28 AD345 2017-09-12      81   11    NA        NA         NA
29 AD345 2017-09-15      72   22    NA        NA         NA
30 AD345 2017-09-18      65   15    48       412 10.0000000
31 AD345 2017-09-19      43   13    NA        NA         NA
32 AD345 2017-09-20      81    1    NA        NA         NA
33 AD345 2017-09-23      20    2    NA        NA         NA
34 AD345 2017-09-27      49    4    NA        NA         NA
35 AD345 2017-09-30      39    9    NA        NA         NA

By the way, there has to be a simpler way to generate the factor of step 1 above, does anyone know how?

Thank you! Good to know how base function works for this case :) — Zi-Xin, Jan 17 '18 at 09:10

R: Sum values based on 2 condition from 2 data frame of different length

2 Answers2

Data