0

I am trying to find the first/last observation by group. I tired both R and excel (because it is so slow in R so I tried excel). The excel took less than one second, but R took 8 MINUTES!!!. The code logic for both are almost the same.

The data is a panel data regarding purchasing fruit. The same shopper could buy multiple times at different time. I have 233,000 observations. Data is like(sorted by day shopper first and day):

    Day Shopper Choice
    1   A   apple
    2   A   apple
    1   B   Banana
    1   C   apple
    2   C   Banana
    3   C   apple
    1   D   berry
    2   D   berry

My r code for the first obseravtion. I want to tag the first observation in a group by indicator "1", a new column.

 for (i in 1:n)
 { ifelse (dt$shopper[i+1]==dt$shopper[i],newcol[i+1]<-0,newcol[i+1]<-1)
  }

My excel code is: if(B2<>B1,1,0)

I need the repurchase matrix given the same shopper. To define the repurchase : the repurchase of the "1st purchase" is the "2nd purchase"; And the repurchase of the "2nd purchase" is the "3rd purchase" The last purchase has no repurchase. Sorry it sounds like a twister. so my solution is get two choice col and move the second col one row above so I can calculate the repurchase matrix by shopper/or aggregate.The desired output for tagging first obs by group should be as follow. With the columns of choice and choice 2 and I can calculate the repurchase matrix by nrow.

Day Shopper Choice  tagging choice 2
 1  A   apple      0       *apple*
 2  A   apple      *apple*  0
 1  B   Banana     0        0
 1  C   apple      0        Banana
 2  C   Banana     Banana   apple 
 3  C   apple      apple    0
 1  D   berry      0        *berry*
 2  D   berry      *berry*  0

[update]. If there is only one purchase for this user, there is no repurchase. If the purchase is the last purchase of the user, there is no repurchase. So the final repurchase matrix by choice in this case is

        second  inside bracket are the probability  
first   apple banana berry   
apple   1 (0.5)  1      0
banana  1        0 (0)  0
berry   0        0      1 (1)

Below is how I calculate the repurchase matrix after tagging the first purchase of the user. i is the row (apple, banana, Berry etc) and j (apple, banana, Berry etc) is the column. [the speed is ok give the fact that I repurchase matrix is 40*40 after tagging and adding the seconding choice column)

    for (i in 1:n){
    for(j in 1:n){
    repurchase_matrix[i,j]=nrow(dt[dt[,1]==i&dt[,2]==j,])}}
MLE
  • 1,033
  • 1
  • 11
  • 30
  • It's not R that's slow. It's your code. `ifelse()` is vectorized. Plus I don't think you can even use `<-` assignment in the `ifelse()` yes/no arguments. But even further, once you get rid of the `for()` loop, `ifelse()` is know to be rather slow as well. If this is a grouped operation on large data I would recommend the *data.table* package for efficiency. – Rich Scriven Jul 09 '16 at 05:03
  • `1-(B2=B1)` should be faster than the if... still richard is right: the way your code is executed takes so much time... – Dirk Reichel Jul 09 '16 at 05:06
  • @RichardScriven. Thank you! I am new to R.:( I also tried the (if, else), still very slow. Any suggestions to make it faster? – MLE Jul 09 '16 at 05:10
  • You just want to make the first row indicator for each group 1, and 0 otherwise? Please provide the desired output based on your example data. – Rich Scriven Jul 09 '16 at 05:12
  • @DirkReichel, thank you.I am very happy with the excel speed but thank you anyway. I will try that. Even if I used if in excel, it takes less than a second to finish tagging the first/last observations. How can I make my R code faster in this case? – MLE Jul 09 '16 at 05:12
  • Could you please add the desired result to your question so we can see the finished product. – Rich Scriven Jul 09 '16 at 05:23

4 Answers4

4

First, assuming the data are sorted by Shopper and then by Day in ascending order, you can add a column indicating the purchase number with

df$Purchase <- unlist(with(df, tapply(Shopper, Shopper, seq_along)))
df
#  Day Shopper Choice Purchase
#1   1       A  apple        1
#2   2       A  apple        2
#3   1       B Banana        1
#4   1       C  apple        1
#5   2       C Banana        2
#6   3       C  apple        3
#7   1       D  berry        1
#8   2       D  berry        2

Then reshape the data-frame to "wide" format with

df.w <- reshape(df[c('Shopper', 'Choice', 'Purchase')],
                idvar='Shopper', v.names='Choice', timevar='Purchase',
                direction='wide')
df.w
#  Shopper Choice.1 Choice.2 Choice.3
#1       A    apple    apple     <NA>
#3       B   Banana     <NA>     <NA>
#4       C    apple   Banana    apple
#7       D    berry    berry     <NA>

Finally you calculate the repurchase matrix of the first two purchases

with(df.w, prop.table(table(First=Choice.1, Second=Choice.2)))
#        Second
#First        apple    Banana     berry
#  apple  0.3333333 0.3333333 0.0000000
#  Banana 0.0000000 0.0000000 0.0000000
#  berry  0.0000000 0.0000000 0.3333333

To calculate the repurchase matrix of all purchases, start with the repurchase matrices of every two consecutive purchases

repurchase <- lapply(seq(2, ncol(df.w) - 1),
                     function(i) table(First=df.w[[i]], Second=df.w[[i + 1]]))
repurchase <- simplify2array(repurchase)
repurchase
#, , 1
#
#        Second
#First    apple Banana berry
#  apple      1      1     0
#  Banana     0      0     0
#  berry      0      0     1
#
#, , 2
#
#        Second
#First    apple Banana berry
#  apple      0      0     0
#  Banana     1      0     0
#  berry      0      0     0

then add all matrices to get the "total" repurchase matrix

apply(repurchase, 1:2, sum)
#        Second
#First    apple Banana berry
#  apple      1      1     0
#  Banana     1      0     0
#  berry      0      0     1

(absolute frequencies)

prop.table(apply(repurchase, 1:2, sum))
#        Second
#First    apple Banana berry
#  apple   0.25   0.25  0.00
#  Banana  0.25   0.00  0.00
#  berry   0.00   0.00  0.25

(relative frequencies)

Ernest A
  • 7,526
  • 8
  • 34
  • 40
  • The repurchase I mean this: the repurchase of the "1st purchase" is the 2nd purchase; And the repurchase of the "2nd purchase" is the "3rd purchase" The last purchase has no repurchase. Sorry it sounds like a twister. – MLE Jul 10 '16 at 01:45
  • @Phdaml I updated the answer (not sure about how you calculate the probabilities). – Ernest A Jul 10 '16 at 09:28
  • @Phdaml if `Choice` is a factor, then all repurchase sub-matrices should have the same dimensions. I think the problem may be that in your data-frame `Choice` is a character vector, rather than a factor. – Ernest A Jul 11 '16 at 08:25
  • You are right!, You code produce the same out put as my original code and so much faster. If you have time, could you explain it a bit why by changing it to factor, work? – MLE Jul 11 '16 at 17:05
  • @Phdaml it's because of the way `table()` works. when the argument is a factor, `table()` counts the occurrences for every level (including unused levels, which will have a count of zero) so the returned array always have the same dimensions. whereas if the argument is a character vector, it never reports zero counts (because it has no way of knowing all the possible values that the variable can take) so the returned array can vary in size at every iteration. – Ernest A Jul 11 '16 at 19:26
2

In R, we can use dplyr. After grouping by 'Shopper', create the 'Flag' column for first observation by using the logical condition row_number() < 2 and convert the logical to integer if required.

library(dplyr)
df1 %>% 
   group_by(Shopper) %>%
   mutate(Flag = as.integer(row_number() < 2))

If we can use the minimum and maximum 'Day' as the identifier, then use the logical condition based on that.

df1 %>% 
     group_by(Shopper) %>%
     mutate(Flag = as.integer(Day %in% range(Day)))

Or using data.table

library(data.table)
setDT(df1)[, Flag := as.integer(Day %in% range(Day)), by = Shopper]

Or using base R, we can compare the previous 'Shopper' with the current 'Shopper' (assuming that the dataset is already ordered)

i1 <- with(df1, Shopper[-1]!= Shopper[-nrow(df1)])
as.integer(c(TRUE, i1)|c(i1, TRUE))
#[1] 1 1 1 1 0 1 1 1

All these methods should be faster than the for loop in the OP's code.

Update

Based on the updated expected output, if we need to replace the 1st observation with "0" while the others remain same, either an ifelse or replace can be used and using the lead of 'tagging', we create the 'tagChoice2'.

df1 %>%
   group_by(Shopper) %>% 
   mutate(tagging = ifelse(row_number()==1, "0", as.character(Choice)), 
          tagChoice2 = lead(tagging, default = "0"))   
#   Day Shopper Choice tagging tagChoice2
#  <int>   <chr>  <chr>   <chr>      <chr>
#1     1       A  apple       0      apple
#2     2       A  apple   apple          0
#3     1       B Banana       0          0
#4     1       C  apple       0     Banana
#5     2       C Banana  Banana      apple
#6     3       C  apple   apple          0
#7     1       D  berry       0      berry
#8     2       D  berry   berry          0
Community
  • 1
  • 1
akrun
  • 874,273
  • 37
  • 540
  • 662
1

I was looking for answer to finding first and last value of a column by grouping in data.table. After looking here and there, and thinking about it, here you go.

To create order of rows by group:

library(data.table)

DT <- data.table(col1 = rep(LETTERS[1:2], each = 4), col2 = c(3,12,5,56,6,678,233,70))
setorder(DT, col1, col2)
DT
   col1 col2
1:    A    3
2:    A    5
3:    A   12
4:    A   56
5:    B    6
6:    B   70
7:    B  233
8:    B  678

DT[, rank := order(col2), by = col1]
DT
   col1 col2 rank
1:    A    3    1
2:    A    5    2
3:    A   12    3
4:    A   56    4
5:    B    6    1
6:    B   70    2
7:    B  233    3
8:    B  678    4

To create first and last values by group:

DT[, first_val := col2[1], by = col2]
DT[, last_val := col2[.N], by = col1]
DT
   col1 col2 rank first_val last_val
1:    A    3    1         3       56
2:    A    5    2         3       56
3:    A   12    3         3       56
4:    A   56    4         3       56
5:    B    6    1         6      678
6:    B   70    2         6      678
7:    B  233    3         6      678
8:    B  678    4         6      678
Masood Sadat
  • 1,247
  • 11
  • 18
0

You can try install the Microsoft R open as your default R. In terms of math calculation, it is way faster than R base. Because it employs more cores while the R.BASE only uses one core to compute.