2

We have sales data that comes from a touchscreen vending-style machine. When a customer puts money in the machine, it starts a session, counting those sessions using a sequence of numbers unique to that machine. Most of the time, the system starts and stops sessions when it should. However, ~7% of the time, it artificially starts a new session when there is still money left in the machine to be spent.

So,

session available.spend actual.spend
    1         20            20
    2         25            17
    3          0             8
    4         15            15
    5         14             7
    6          0             7
    7         59            50
    8          0             9
    9         15            15
    10        21            21

where available.spend is a sum of all the different columns indicating money or vouchers were inserted into the machine, and actual.spend is a sum of all the money spent during that session.

So, most of the time they equal one another. However, in session 2, $25 was inserted and only $17 was spent. Session 3 shows no money available to be spent, but $8 actually spent, which balances out the first session.

I'd like to have R combine those sessions and create an indicator column telling me the new session is a result of combining sessions.

How would I have R look to see if a session balanced, then, if it does not, check the next session to see if:

  1. there was no available.spend;
  2. there was actual.spend; and,
  3. the actual.spend from both sessions == the available.spend from the first session

Then, if (and only if) all three criteria are met, those two session are combined (using either session number or a new, made-up one), and a new column with a 1 showing that the new session is a result of combining other sessions.

Here's the dput() for my made-up sample:

mydt<-    structure(list(session = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), available.spend = c(20, 
      25, 0, 15, 14, 0, 59, 0, 15, 21), actual.spend = c(20, 17, 8, 
      15, 7, 7, 50, 9, 15, 21)), .Names = c("session", "available.spend", 
      "actual.spend"), row.names = c(NA, -10L), class = c("data.table", 
      "data.frame"), .internal.selfref = <pointer: 0x0000000000300788>)

Here's what I'd like the output to look like:

session available.spend actual.spend    newsess
    1          20           20             0
    2          25           25             1
    4          15           15             0
    5          14           14             1
    7          59           59             1
    9          15           15             0
    10         21           21             0

and the dput():

mynew.dt<- structure(list(session = c(1, 2, 4, 5, 7, 9, 10), available.spend = c(20, 
            25, 15, 14, 59, 15, 21), actual.spend = c(20, 25, 15, 14, 59, 
            15, 21), newsess = c(0, 1, 0, 1, 1, 0, 0)), .Names = c("session", 
            "available.spend", "actual.spend", "newsess"), row.names = c(NA, 
            -7L), class = c("data.table", "data.frame"), .internal.selfref = <pointer:                     0x0000000000300788>)

I've been trying to find ways to this in data.table (the dataset is very large) and/or with ifelse, but I can't figure out how to check three conditions and only perform an action if it meets all three, while also deleting the old columns and creating a dummy variable column. Whew

One more wrinkle: these session IDs can (though it happens rarely) occur on more than one day. So, the code would have to either look for the very next line in the data.frame or, if it looked for the session that comes next sequentially, it would need to make sure the dates on the two sessions matched.

Thanks for any help.

datahappy
  • 826
  • 2
  • 11
  • 29
  • This actually can get a bit more complicated in that _sometimes_, the session might not balance in the next session, but over the next two (or three). This is rare enough, though, that I'm hoping I can ignore those. ~7% of sessions being artificially created, however, is too much to ignore. – datahappy Apr 04 '14 at 18:22

3 Answers3

1

So not quite the same result data.frame you want. I am using cumsum (Cumulative Sum) on available spend and actual spend. Then I check which ones are match up, and only for those that matchup I put "1" in the new.session column.

mydt$spend.sum <-cumsum(mydt$actual.spend) #Cumulative sum of actual 
mydt$avail.sum <-cumsum(mydt$available.spend) #Cumulative sum of actual 

now make a new column and make it all NA's

mydt$new.session <-NA

Check which cumulative sums match up and replace NA's with 1's

mydt$new.session[with(mydt, which(spend.sum == avail.sum))]<-1

If you only want data.frame with the 1's in the new.session column

do this

mydt[complete.cases(my.dt),]
infominer
  • 1,981
  • 13
  • 17
  • That is certainly a viable workaround. My concern is that with 7% of session being knocked out, the results will be skewed. The data.table has ~24 million rows, so 7% is a sizeable number of customers being removed from the analysis. If I can't figure out the above solution, this will be as good as I can hope for. Thanks for taking the time to figure it out. – datahappy Apr 04 '14 at 19:11
  • On second look, cumsum isn't the way to go. I can get sessions where available and actual meet by just doing the last three lines of code. Cumsum make it seem as if that only occurs a few times out of millions (the first four). – datahappy Apr 04 '14 at 19:25
  • @infominer fyi and to show the difference between readability of `data.frame` and `data.table`, this is the `data.table` way: `mydt[, new.session := cumsum(available.spend) == cumsum(actual.spend)]`; or, to avoid overflows: `mydt[, new.session := cumsum(available.spend - actual.spend) == 0]` – eddi Apr 04 '14 at 19:57
  • @datahappy I don't understand why you think `cumsum` doesn't do what you want - you may want to give an example where it fails – eddi Apr 04 '14 at 19:59
  • @eddi because cumsum does just that- provides an overall cumulative sum. As soon as there was a single instance of available and actual not being equal, then they were unequal for each and every instance after that. (Which occurred at the fourth observation). Thus, when a data.table of only complete cases was created it only contained the first four observations, rather than the 21 million that it should have contained. – datahappy Apr 04 '14 at 20:06
  • There was already a column showing available spend and actual spend- by just setting the new.session column equal to 1 when those two columns equalled (sans cumsum), the completed.cases DT was correct. – datahappy Apr 04 '14 at 20:08
1

If your available.spend is always zero in these cases, you can use that to group the rows (I'm assuming you sometimes have more than one of these 0's in a row, otherwise you could trivially just take actual.spend, shift it by 1 and sum back to compare):

dt[, list(session = session[1],
          available.spend = sum(available.spend),
          actual.spend = sum(actual.spend)),
     by = cumsum(available.spend != 0)]
#   cumsum session available.spend actual.spend
#1:      1       1              20           20
#2:      2       2              25           25
#3:      3       4              15           15
#4:      4       5              14           14
#5:      5       7              59           59
#6:      6       9              15           15
#7:      7      10              21           21

From this point on you should have all the info you need to proceed.

Perhaps, more generally, it would be better to group by cumsum(available.spend >= actual.spend).

eddi
  • 49,088
  • 6
  • 104
  • 155
  • Ok, so this worked like a charm- kinda. It combined the sessions correctly, fixing all the artificially split sessions. Unfortunately, the data.table it created only contained the variables session, available.spend, and actual.spend (plus a cumsum column). How do I get it to keep and combine all the other (22) columns in the data.table? – datahappy Apr 08 '14 at 12:56
  • It looks like I'll need to individually add all the variables to the list? In the same way, I assume (e.g. totsales = sum(totsales), etc) Is there a simpler way to do this? – datahappy Apr 08 '14 at 13:02
  • 1
    @datahappy use `.SD[1]`, instead of `session = session[1]` and if that's too slow, see [this post](http://stackoverflow.com/q/16573995/817778); or perhaps I'm misunderstanding and you want to use `lapply(.SD, sum)` instead of the individual sums; take your pick :) – eddi Apr 08 '14 at 15:12
0

This is kind of a clunky solution, but given the narrow parameters and desired outcome, I can't think of a better way to do this except piece-by-piece.

mismatches <- mydt[available.spend != actual.spend, which=TRUE]
zeros <- mydt[available.spend == 0, which=TRUE]
x <- setdiff(mismatches, zeros)
followcheck <- mydt[x+1, session == mydt[zeros, session] & actual.spend > 0]
following.zeros <- zeros[followcheck]
sumthing <- mydt[x, available.spend==actual.spend + mydt[following.zeros, actual.spend]]
x <- x[sumthing]
y <- x + 1
mydt[x, actual.spend:=actual.spend + mydt[y, actual.spend]]
# Caution here, data.table gave a warning about needing to copy the table in memory to do this next line.
mydt[, newsess:=0]
mydt[x, newsess:=1]
mydt <- mydt[-y,]
doicomehereoften1
  • 537
  • 1
  • 4
  • 12