3

I have a table that has two columns

Order | CustomerID

 1. A | C1 
 2. B | C1 
 3. C | C1 
 4. D | C2 
 5. B | C3 
 6. C | C3
 7. D | C4

Its a long table. I want an output that shows

C1 | C3 | 2  #Customer C1 and Customer C3 share 2 orders (i.e. orders, B & C) 
C1 | C2 | 0   #Customer C1 and Customer C2 share 0 orders 
C2 | C4 | 1   #Customer C2 and Customer C4 share 1 orders (i.e. order, D)
C2 | C3 | 0  Customer C2 and Customer C3 share 0 orders  
IRTFM
  • 258,963
  • 21
  • 364
  • 487
user2162611
  • 146
  • 3
  • 13
  • so if there are n customers do you have to show the counts for nC2 combinations? and if you plan to do this with a sql query, please tag the dbms being used. – Vamsi Prabhala Mar 01 '17 at 20:42
  • 1
    Join to itself on order, group by customerID1 and customerID2, and count. Plenty of posts about [merging/joining](http://stackoverflow.com/questions/1299871) and about [group by counting](http://stackoverflow.com/questions/9809166/count-number-of-rows-within-each-group). – zx8754 Mar 01 '17 at 20:51
  • Its ok to not show combinations with 0 shared orders.I wanted to illustrate – user2162611 Mar 01 '17 at 20:54
  • 1
    For the [r] tag, possible duplicate of http://stackoverflow.com/questions/28742825/count-occurence-of-values-for-every-possible-pair – zx8754 Mar 01 '17 at 21:00

4 Answers4

5
select 
    a.CustomerId
  , b.CustomerId
  , sum(case when a.[Order] = b.[Order] then 1 else 0 end) as SharedOrders
from t as a
  inner join t as b
    on a.CustomerId < b.CustomerId
group by a.CustomerId, b.CustomerId

test setup: http://rextester.com/ISSCL35174

returns:

+------------+------------+--------------+
| CustomerId | CustomerId | SharedOrders |
+------------+------------+--------------+
| C1         | C2         |            0 |
| C1         | C3         |            2 |
| C2         | C3         |            0 |
| C1         | C4         |            0 |
| C2         | C4         |            1 |
| C3         | C4         |            0 |
+------------+------------+--------------+

To just return shared orders:

select a.CustomerId
     , b.CustomerId
     , count(*) as SharedOrders
from t as a
  inner join t as b
    on a.CustomerId < b.CustomerId
   and a.[Order] = b.[Order]
group by a.CustomerId, b.CustomerId

returns:

+------------+------------+--------------+
| CustomerId | CustomerId | SharedOrders |
+------------+------------+--------------+
| C1         | C3         |            2 |
| C2         | C4         |            1 |
+------------+------------+--------------+
SqlZim
  • 37,248
  • 6
  • 41
  • 59
4

Here is a base R method using table, crossprod, combn, and matrix subsetting.

# get counts of customer IDs
myMat <- crossprod(with(df, table(Order, CustomerID)))
myMat
          CustomerID
CustomerID C1 C2 C3 C4
        C1  3  0  2  0
        C2  0  1  0  1
        C3  2  0  2  0
        C4  0  1  0  1

Note here that the diagonal is the count of total orders that each customer made and the (symmetric) off diagonals are the counts of orders that each customer shared.

# get all customer pairs
customers <- t(combn(rownames(myMat), 2))

# use matrix subsetting to pull out order counts and cbind.data.frame to put it together
cbind.data.frame(customers, myMat[customers])
   1  2 myMat[customers]
1 C1 C2                0
2 C1 C3                2
3 C1 C4                0
4 C2 C3                0
5 C2 C4                1
6 C3 C4                0

You could use wrap this in setNames to add names if desired to give specific variable names

setNames(cbind.data.frame(customers, myMat[customers]), c("c1", "c2", "counts"))

data

df <- 
structure(list(Order = c("A", "B", "C", "D", "B", "C", "D"), 
    CustomerID = c("C1", "C1", "C1", "C2", "C3", "C3", "C4")), .Names = c("Order", 
"CustomerID"), class = "data.frame", row.names = c(NA, -7L))
lmo
  • 37,904
  • 9
  • 56
  • 69
1

A SQL server demo (but the code is generic):

; with data as (select 'A' as [Order], 'C1' as CustomerID 
                union all 
                select 'B', 'C1'
                union all 
                select 'C', 'C1'
                union all 
                select 'D', 'C2'
                union all 
                select 'B', 'C3'
                union all 
                select 'C', 'C3'
                union all 
                select 'D', 'C4'
        )
select c1, c2, count(*) from (
select x.[Order], x.CustomerID c1, y.CustomerID c2
from data x join data y on x.[Order] = y.[Order] and x.CustomerID < y.CustomerID
) temp
group by c1, c2

This takes into account only the pairs sharing at least one order. I think returning pairs not sharing any orders would be a waste of resources.

Giorgos Altanis
  • 2,742
  • 1
  • 13
  • 14
1

I would use a cross join to get all pairs of customers, and then left joins to bring in the order. The final step is aggregation:

select c1.CustomerId, c2.CustomerId, count(t2.Order) as inCommon
from (select distinct CustomerID from t) c1 cross join
     (select distinct CustomerID from t) c2 left join
     t t1
     on t1.CustomerId = c1.CustomerId left join
     t t2
     on t2.CustomerId = c2.CustomerId and
        t2.Order = t1.Order
where c1.CustomerId < c2.CustomerId
group by c1.CustomerId, c2.CustomerId;

This process is a bit tricky because you want pairs that have no orders in common.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786