0

Let me first give you an idea of how the data looks like:

Customer Value Module SubModule ModuleTF month department newCust
   1       5     M1      SM1       1      1      DEP1       0
   1       3     M1      SM1       1      2      DEP1       0
   1       8     M1      SM1       1      3      DEP1       0
   1       4     M2      SM1       1      1      DEP2       0
   1       5     M2      SM2       1      1      DEP2       0
   1       45    A5     null       0      1      DEP2       0
   2
   ...

What I would like to do is to calculate a slope for VALUE of MONTH where it would be a new column in df. The problem is that It would need to be calculated for every module, sub module, and department. Not calculated if newCust = 0. The thing is also that sometimes values for X month are null and therefore not present in the dataset. I would like these null values to be included as they obviously affect the slope. What is more, Modules sometimes do not have a submodule and calculation should be done in this case as well. Would it be necessary to enter those null values so all Modules and Sub Modules have equal number of entries?

I would like the outcome to look sth like this

Customer Value Module SubModule ModuleTF month department newCust slope
   1       5     M1      SM1       1      1      DEP1       0       1.2
   1       3     M1      SM1       1      2      DEP1       0       1.2
   1       8     M1      SM1       1      3      DEP1       0       1.2
   1       4     M2      SM1       1      1      DEP2       0       1.35
   1       5     M2      SM2       1      1      DEP2       0       1.11
   1       45    A5     null       0      1      DEP2       0       0.23 
   2
   ...

Any help will be more than appreciated!

Thanks!

sampak
  • 155
  • 1
  • 2
  • 12
  • (1) In general, this looks like a subsetting/grouping thing, there are a lot of good Q/As on SO about it. (2) *"I would like these null values to be included as they obviously affect the slope"* ... okay, how do you intend to represent this, dummy variables? (3) If this is a question about programming, it would benefit greatly from knowing what you've already tried and what problems you are having with the actual code. – r2evans Aug 07 '17 at 08:59
  • To add to @r2evans comments: It would be great if you can provide the data in a format that is easy to copy (i.e., follow an [MWE](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)) On a side note: Are you sure that the slope for DEP1 is 1.2 and not 1.5 (this is the value when I regress Value on Month for the first three points). – David Aug 07 '17 at 09:10
  • I have been thinking of creating loops where 1. subset by custid, 2. subset single custid by module 3. subset module of single cust by submodule, and then iterate 3 times for every month and do the calculations and enter it into a new column. Then the solution of a single subset would append a new df. Where the values of specific month don't exist, they should just be 0s. @David, I did not calculate that, just typed whatever, but you're right. The reason I asked is that I am new to R and did not really know where to start from. Had some basic thinking of it so far. – sampak Aug 07 '17 at 10:16

1 Answers1

1

What you can do is use dplyr and purrr to create linear models using the lm-function by different groups.

Taken the necessary data from your example, one could do

library(dplyr) # for the data munging
library(purrr) # for the do-function (modelling)

# create some example data
df <- data_frame(
  customer = rep(1, 6),
  value = c(5, 3, 8, 4, 5, 45),
  month = c(1, 2, 3, 1, 2, 3),
  departement = rep(c("Dep1", "Dep2"), each = 3)
)
# look at the data
df
#> # A tibble: 6 x 4
#>   customer value month departement
#>      <dbl> <dbl> <dbl>       <chr>
#> 1        1     5     1        Dep1
#> 2        1     3     2        Dep1
#> 3        1     8     3        Dep1
#> 4        1     4     1        Dep2
#> 5        1     5     2        Dep2
#> 6        1    45     3        Dep2

# create a linear model per group
df %>%
  group_by(customer, departement) %>% 
  do(mod_lin = lm(value~month, data = .)) %>% 
  mutate(intercept = mod_lin$coefficients[1],
         slope = mod_lin$coefficients[2])
#> Source: local data frame [2 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 2 x 5
#>   customer departement  mod_lin  intercept slope
#>      <dbl>       <chr>   <list>      <dbl> <dbl>
#> 1        1        Dep1 <S3: lm>   2.333333   1.5
#> 2        1        Dep2 <S3: lm> -23.000000  20.5

If you want to learn more about the aspects of the code, just search for dplyr, r piping, and purrr. The two packages have wonderful explanations.

David
  • 9,216
  • 4
  • 45
  • 78
  • Thank you @David, I will have a look at these packages before I will try to make it work. – sampak Aug 07 '17 at 13:42
  • Did that help you? If so, please consider upvoting and or accepting my answer. – David Aug 08 '17 at 06:32
  • When I try to run it, tells me that object mod_lin not found – sampak Aug 08 '17 at 10:17
  • Have you successfully installed and loaded the purrr library? – David Aug 08 '17 at 10:36
  • Does that error come when you run my code in an empty environment? (I.e., use rm(list=ls()) to remove all objects in scope) – David Aug 08 '17 at 10:49
  • Yes, this is the error, little mistake in your code, there is a coma after department where creating a df – sampak Aug 08 '17 at 11:14
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/151378/discussion-between-david-and-sampak). – David Aug 08 '17 at 11:33