0

I have the following dataframe.

    SEC VORDEN_PREVENT1  VORDEN_PREVENT2  VORDEN_PREVENT3  VORDEN_PREVENT4 VORDEN_PREVENT5
2484628            1500             1328             2761             3003            2803
2491884            1500             1500             1169             2813            1328
2521158            1500             2813             1328             2761            3003
2548370            1500             1257             2595             1187            1837
2580994            1500             5057             2624             2940            2731
2670164            1500             1874             1218             2791            2892

In this dataframe I have as VORDEN_PREVENT* the number of cars sold every day, for example VORDEN_PREVENT1 means that I sold this day 1500 cars, what I want is to return the columns from the rows that produces a purchase of for example 3000 cars.

For that example, should be 1500 from VORDEN_PREVENT1, 1328 from VORDEN_PREVENT2 and 172 from VORDEN_PREVENT3, which is the difference from 2761 and the sum from VORDEN_PREVENT1 and VORDEN_PREVENT2.

I don't know how to obtain this row and column data and to get the difference properly, to obtain my data correctly.

Uwe
  • 41,420
  • 11
  • 90
  • 134
fiticida
  • 664
  • 1
  • 10
  • 24

2 Answers2

1

If I understand correctly, the VORDEN_PREVENT* columns denote sales on subsequent days. The OP asks on which day the cumulative sum of sales exceeds a given threshold. In addition the OP wants to see the sales figures which sum up to threshold.

I suggest to solve this type of questions in long format where columns can be treated as data.

1. melt() / dcast()

library(data.table)
threshold <- 3000L
long <- melt(setDT(DT), id.var = "SEC")
long[, value := c(value[1L], diff(pmin(cumsum(value), threshold))), by = SEC]
dcast(long[value > 0], SEC ~ variable)
       SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3
1: 2484628            1500            1328             172
2: 2491884            1500            1500              NA
3: 2521158            1500            1500              NA
4: 2548370            1500            1257             243
5: 2580994            1500            1500              NA
6: 2670164            1500            1500              NA

2. gather() / spread()

library(tidyr)
library(dplyr)
threshold <- 3000L
DT %>% 
  gather(, , -SEC) %>% 
  group_by(SEC) %>% 
  mutate(value = c(value[1L], diff(pmin(cumsum(value), threshold)))) %>% 
  filter(value >0) %>% 
  spread(key, value)
# A tibble: 6 x 4
# Groups:   SEC [6]
      SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3
    <int>           <int>           <int>           <int>
1 2484628            1500            1328             172
2 2491884            1500            1500              NA
3 2521158            1500            1500              NA
4 2548370            1500            1257             243
5 2580994            1500            1500              NA
6 2670164            1500            1500              NA

3. apply()

With base R:

DT[, -1] <- t(apply(DT[, -1], 1, function(x) c(x[1L], diff(pmin(cumsum(x), threshold)))))
DT
      SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3 VORDEN_PREVENT4 VORDEN_PREVENT5
1 2484628            1500            1328             172               0               0
2 2491884            1500            1500               0               0               0
3 2521158            1500            1500               0               0               0
4 2548370            1500            1257             243               0               0
5 2580994            1500            1500               0               0               0
6 2670164            1500            1500               0               0               0

Data

library(data.table)
DT <- fread("
    SEC VORDEN_PREVENT1  VORDEN_PREVENT2  VORDEN_PREVENT3  VORDEN_PREVENT4 VORDEN_PREVENT5
2484628            1500             1328             2761             3003            2803
2491884            1500             1500             1169             2813            1328
2521158            1500             2813             1328             2761            3003
2548370            1500             1257             2595             1187            1837
2580994            1500             5057             2624             2940            2731
2670164            1500             1874             1218             2791            2892",
data.table = FALSE)
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
0

Your question is not very clear to me so I reduce it to what I understand (you want to create a column, then filter on rows). Using dplyr this can be done quite easily but we first recreate some data.

# recreate some data
df <- data.frame(time=1:3,
                 sales1=c(1234, 1567, 2045),
                 sales2=c(865, 756, 890))

# first create a diff column
df <- df %>% mutate(sales_diff=sales1-sales2)

df
time sales1 sales2 sales_diff
   1   1234    865        369
   2   1567    756        811
   3   2045    890       1155

# then you can access the rows you're interested in by filtering them
df %>% filter(sales1==1567)

time sales1 sales2 sales_diff
   2   1567    756        811

You can just replace the object/column names with your own data. Is that what you were looking for?

Vincent Bonhomme
  • 7,235
  • 2
  • 27
  • 38
  • What I want to obtain is the following. I want to know in how many days I have sold 2000 cars. In this case for the first example you are using, I want to get 1234 from first sales1 and the 766 from sales2 which is the result from the difference between sales1 and 2000, and as in sales2 I have more cars volumen that what I am asking I return sales1 1234(all volume) and sales2 (766) . I sold 1234 first day and 766 from the second day. – fiticida Aug 28 '18 at 12:59