1

I have a data set which includes seller ID, and number of the total sold TV in each year. An example of this data frame is shown below:

ID <- c(1006332,1010660,1010852,1012960,1012960,1012960,1012960,1012960,1013515,1013515,1013515,1013515,101351)
Sold_year <- c(2017,2016,2011,2011,2012,2013,2015,2016,2014,2015,2016,2018,2019)
n <- c(1,1,1,12,1,3,4,1,2,1,3,1,2)
data <- data.frame(ID,Sold_year,n)

        ID Sold_year  n
1  1006332      2017  1
2  1010660      2016  1
3  1010852      2011  1
4  1012960      2011 12
5  1012960      2012  1
6  1012960      2013  3
7  1012960      2015  4
8  1012960      2016  1
9  1013515      2014  2
10 1013515      2015  1
11 1013515      2016  3
12 1013515      2018  1
13 1013515      2019  2

what I want to do: I want to calculate the average gradient of sold TV for each seller. So for example: for the seller 1013515, I want to do the following calculation:

grad1 = (1-2)/(2015-2014)=-1
grad2 = (3-1)/(2016-2015)= 2
grad3 = (1-3)/(2018-2016)=-1
grad4 = (2-1)/(2019-2018)= 1

average= -1+2-1+1/(4)=0.25

Also, for a seller who has sold a TV only in one year, I consider the gradient as 0 and will add a penalty which is the difference between the sold year and 2020. So for seller 1006332 this average gradient would be

0-(2020-2017)=-3

What I did so far: in order to do the following calculation, I used a for loop to go through a list of specific agent each time and calculate the gradient. This approach works and give the right result, however, it's slow for large datasets. I was thinking maybe there is a way to do it simply with dplyr and data frame calculations. I was thinking to use lag function and here is what I proposed:

test2 <- data %>% 
  group_by(ID) %>% 
  mutate(grad=(n-lag(n))/(Sold_year-lag(Sold_year)))

the result of this code is:

 1                    1006332      2017     1  NA  
 2                    1010660      2016     1  NA  
 3                    1010852      2011     1  NA  
 4                    1012960      2011    12  NA  
 5                    1012960      2012     1 -11  
 6                    1012960      2013     3   2  
 7                    1012960      2015     4   0.5
 8                    1012960      2016     1  -3  
 9                    1013515      2014     2  NA  
10                    1013515      2015     1  -1  
11                    1013515      2016     3   2  
12                    1013515      2018     1  -1  
13                    1013515      2019     2   1  

As you see, it successfully calculates the gradient for the seller with more than one working year; however, I am not sure how I should calculate the final average gradient and also how I should add a penalty to those sellers with one working year(seller : 1006332, 1010660, etc)

the expected output should be in this form:

                                      ID                       Ave_grad
1                                        1006332                      -3.00
2                                        1010660                      -4.00
3                                        1010852                      -9.00
4                                        1012960                      -2.87
5                                        1013515                       0.25

any suggestions or comments?

Thanks

Ross_you
  • 881
  • 5
  • 22
  • @akrun, I added the result, so basically, ID= 1006332 , 1010660, 1010852 will get a penalty as we only have one working year. but for two other IDs, we simply take the average of calculated gradients. – Ross_you Oct 03 '20 at 20:39
  • With that difference divisioin code you showed, not able to replicate the expected output you showed – akrun Oct 03 '20 at 20:50
  • @akrun. Exactly, mu solution can't create the result. That's why I asked for comments or any other solutions that can produce the result. I just shared my solution as a starting point to calculate gradients.. – Ross_you Oct 03 '20 at 20:54
  • @akrun, I would be happy if you can provide any other solution that doesn't use for loop – Ross_you Oct 03 '20 at 20:55
  • I tried your manual calculation to come up with `data %>% group_by(ID) %>% summarise(grad = if(n() == 1) -3 else mean(diff(n)/diff(Sold_year)))`, but it is not giving the expected answer showed. Is the values correct – akrun Oct 03 '20 at 20:56
  • 1
    I think the ID last you showed is `101351` missing the `5` at the end – akrun Oct 03 '20 at 20:58
  • Please check the solution below. It now works. I didn't earlier check for the typos in your initial data. – akrun Oct 03 '20 at 21:03
  • @akrun, thanks, I didn't know we can use if/else in dplyr. Just a quick question, can you please elaborate more why you used "groups = 'drop'" at the end? what does this do? – Ross_you Oct 03 '20 at 21:58
  • 1
    With the new version, if you don't use that, it will drop the last group with `drop_last` (when there are multiple grouping variables) and it will throw a friendly warning message. I just wanted to remove that warning and to make sure that we are dropping all the group attributes. You could change that in `.groups. You may check [this](https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override/62140681#62140681) for more info – akrun Oct 03 '20 at 22:01

1 Answers1

1

We can do an if/else condition after grouping by 'ID', i.e. if the number of rows (n()) is 1, then subtract the 'Sold_year' with the current year (year(Sys.Date()) or else divide the difference of 'n' with difference of 'Sold_year' and take the mean

library(dplyr)
library(lubridate)
data %>%
    group_by(ID) %>%
    summarise(grad = if(n() == 1) (Sold_year - year(Sys.Date())) else  
         mean(diff(n)/diff(Sold_year)), .groups = 'drop')

-output

# A tibble: 5 x 2
#       ID  grad
#    <dbl> <dbl>
#1 1006332 -3   
#2 1010660 -4   
#3 1010852 -9   
#4 1012960 -2.88
#5 1013515  0.25
akrun
  • 874,273
  • 37
  • 540
  • 662