0

I have a problem in data wrangling in R. So I have a data frame like this:

        CardID       Date Amount ItemNumber    ItemCode
1  C0100000111 2001-07-19 449.00          1 I0000000808
2  C0100000111 2001-02-20   9.99          1 I0000000622
3  C0100000111 2001-04-27  49.99          1 I0000000284
4  C0100000111 2001-02-20  69.00          1 I0000000488
5  C0100000111 2001-05-17 299.00          1 I0000000595
6  C0100000111 2001-05-19   5.99          1 I0000000078
7  C0100000199 2001-08-20 229.00          1 I0000000783
8  C0100000199 2001-12-29 229.00          1 I0000000783
9  C0100000199 2001-06-28 139.00          1 I0000000537
10 C0100000343 2001-09-07  99.00          1 I0000000532

I want to convert it in a structure like this,

CardID, FirstPurchaseDate, LastPurchaseDate, NumberOrders, NumberSKUs, TotalAmounts

Where each row of CardID in the new table is unique. How can I make this possible?

Based on the table above, I expected an output like this

> Ex
       CardID FirstPurchaseDate LastPurchaseDate NumberOrders NumberSKUs TotalAmounts
1 C0100000111        2001-02-20       2001-07-19            6          6       882.97
2 C0100000199        2001-06-28       2001-12-29            3          2       597.00
3 C0100000343        2001-09-07       2001-09-07            1          1        99.00
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
hexnicam
  • 3
  • 3
  • For more options see http://stackoverflow.com/questions/9847054/how-to-get-summary-statistics-by-group – m-dz Sep 19 '16 at 08:54

2 Answers2

2

We can use summarise after grouping by 'CardID' with dplyr

library(dplyr) 
df1 %>% 
    group_by(CardID) %>% 
    summarise(FirstPurchaseDate = first(Date),
              LastPurchaseDate = last(Date),
              NumberOrders = n(), 
              NumberSKUs= n_distinct(ItemCode),
              TotalAmount = sum(Amount) )
akrun
  • 874,273
  • 37
  • 540
  • 662
1

A data.table version below:

library(data.table)

dt <- data.frame(
  CardID = c("C0100000111", "C0100000111", "C0100000111", "C0100000111", "C0100000111", "C0100000111", "C0100000199", "C0100000199", "C0100000199", "C0100000343"),
  Date = as.Date(c("2001-07-19", "2001-02-20", "2001-04-27", "2001-02-20", "2001-05-17", "2001-05-19", "2001-08-20", "2001-12-29", "2001-06-28", "2001-09-07")),
  Amount = c(449, 9.99, 49.99, 69, 299, 5.99, 229, 229, 139, 99),
  ItemNumber = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
  ItemCode = c("I0000000808", "I0000000622", "I0000000284", "I0000000488", "I0000000595", "I0000000078", "I0000000783", "I0000000783", "I0000000537", "I0000000532")
)

# Convert to data.table
setDT(dt)

dt[, .(
  FirstPurchaseDate = min(Date),
  LastPurchaseDate = max(Date),
  NumberOrders = .N,
  NumberSKUs = length(unique(ItemCode)),
  TotalAmounts = sum(Amount)
), by = CardID]

Result:

        CardID FirstPurchaseDate LastPurchaseDate NumberOrders NumberSKUs TotalAmounts
1: C0100000111        2001-02-20       2001-07-19            6          6       882.97
2: C0100000199        2001-06-28       2001-12-29            3          2       597.00
3: C0100000343        2001-09-07       2001-09-07            1          1        99.00

Edit: Akrun was first, so go for his answer! Leaving this one just for a data.table reference. I should start using dplyr more...

m-dz
  • 2,342
  • 17
  • 29