57

I have a set of data from a set of discrete choice tasks which included two alternatives with three attributes (brand, price, performance). From this data, I have taken 1000 draws from the posterior distribution which I'll then use to calculate utility and eventually preference share for each individual and each draw.

Price and performance were tested at discrete levels (-.2, 0, .2) and (-.25, 0, .25) respectively. I need to be able to interpolate utility between attribute levels tested. Let's assume for now that a linear interpolation is a reasonable thing to do statistically. In other words, what is the most efficient way to interpolate the utility for price if I wanted to test a scenario with price @ 10% lower? I have not been able to think of a slick or efficient way to do the interpolation. I've resorted to an mapply() approach with the mdply function from plyr

Here's some data and my current approach:

library(plyr)
#draws from posterior, 2 respondents, 2 draws each
draw <- list(structure(c(-2.403, -2.295, 3.198, 1.378, 0.159, 1.531, 
1.567, -1.716, -4.244, 0.819, -1.121, -0.622, 1.519, 1.731, -1.779, 
2.84), .Dim = c(2L, 8L), .Dimnames = list(NULL, c("brand_1", 
"brand_2", "price_1", "price_2", "price_3", "perf_1", "perf_2", 
"perf_3"))), structure(c(-4.794, -2.147, -1.912, 0.241, 0.084, 
0.31, 0.093, -0.249, 0.054, -0.042, 0.248, -0.737, -1.775, 1.803, 
0.73, -0.505), .Dim = c(2L, 8L), .Dimnames = list(NULL, c("brand_1", 
"brand_2", "price_1", "price_2", "price_3", "perf_1", "perf_2", 
"perf_3")))) 

#define attributes for each brand: brand constant, price, performance
b1 <- c(1, .15, .25)
b2 <- c(2, .1, .2)

#Create data.frame out of attribute lists. Wil use mdply to go through each 
interpolateCombos <- data.frame(xout = c(b1,b2), 
                                atts = rep(c("Brand", "Price", "Performance"), 2),
                                i = rep(1:2, each = 3),
                                stringsAsFactors = FALSE)

#Find point along line. Tried approx(), but too slow

findInt <- function(x1,x2,y1,y2,reqx) {
  range <- x2 - x1
  diff <- reqx - x1
  out <- y1 + ((y2 - y1)/range) * diff
  return(out)
}


calcInterpolate <- function(xout, atts, i){
  if (atts == "Brand") {
    breaks <- 1:2
    cols <- 1:2
  } else if (atts == "Price"){
    breaks <- c(-.2, 0, .2)
    cols <- 3:5
  } else {
    breaks <- c(-.25, 0, .25)
    cols <- 6:8
  }

  utils <- draw[[i]][, cols]

  if (atts == "Brand" | xout %in% breaks){ #Brand can't be interpolated or if level matches a break
    out <- data.frame(out = utils[, match(xout, breaks)])
    } else{ #Must interpolate    
    mi <- min(which(breaks <= xout))
    ma <- max(which(breaks >= xout))
    out <- data.frame(out = findInt(breaks[mi], breaks[ma], utils[, mi], utils[,ma], xout))
    }
  out$draw <- 1:nrow(utils)
  return(out)
}
out <- mdply(interpolateCombos, calcInterpolate)

To provide context on what I'm trying to accomplish without interpolating attribute levels, here's how I'd do that. Note the brands are now defined in terms of their column reference. p1 & p2 refer to the product definition, u1 & u2 are the utility, and s1, s2 are the preference shares for that draw.

Any nudge in the right direction would be appreciated. My real case has 10 products with 8 attributes each. At 10k draws, my 8gb of ram are crapping out, but I can't get out of this rabbit hole I've dug myself.

p1 <- c(1,2,1)
p2 <- c(2,1,2)


FUN <- function(x, p1, p2) {
  bases <- c(0,2,5)

  u1 <- rowSums(x[, bases + p1])
  u2 <- rowSums(x[, bases + p2])
  sumExp <- exp(u1) + exp(u2)
  s1 <- exp(u1) / sumExp
  s2 <- exp(u2) / sumExp
  return(cbind(s1,s2))
}
lapply(draw, FUN, p1 = p1, p2 = p2)

[[1]]
                s1        s2
[1,] 0.00107646039 0.9989235
[2,] 0.00009391749 0.9999061

[[2]]
              s1        s2
[1,] 0.299432858 0.7005671
[2,] 0.004123175 0.9958768
Chase
  • 67,710
  • 18
  • 144
  • 161
  • 2
    What do you wish to improve? Speed? Memory? Statistical convergence? (All of the above? :)) I can sort of grok your code, but I don't understand the statistical problem you're trying to solve. Humor me: why wouldn't this simply be ordinary least squares regression? `lm()` is okay with categorical factors, and (unless I'm mistaken) ordered factors as well. – Iterator Oct 26 '11 at 21:44
  • @Iterator - I'll take option 3 for 500 please! The main issues I am running into are both speed and memory based. The statistical problem, if I were to even classify it as such, is simply needing to perform a linear interpolation for 80+ attributes for a matrix containing somewhere around 0.5M rows. Two easy changes I've made proved quite significant. Namely: 1) moving to `mapply()`from `mdply()`. 2) Changing all data.frames to matrices. I'm still not convinced that the whole approach is as efficient computationally as it could be, but it's sufficient for now. When I get to a stopping point – Chase Oct 27 '11 at 03:23
  • I'll update / answer my own question for posterity. I gather you do a fair amount of related work (with the namesake and all), so can talk in more detail offline if it's of interest! – Chase Oct 27 '11 at 03:25
  • 4
    Did you use `mdply(.parallel = TRUE)` with an appropriate parallel (multicore) backend registered? – Iterator Oct 27 '11 at 03:41
  • 3
    Did you try using the compiler()? It is often, but not always, beneficial. Another idea is that `data.table` could be even better than matrices, but it depends on what's going on. – Iterator Oct 27 '11 at 03:47
  • @Iterator - both good tips, yes on the parallel though I didn't have great results (probably more user error than anything else). No on `compiler`, though will try that in the AM. I am using `data.table()` for the aggregation tasks from there onward, but couldn't wrap my head around using it for this part of it. – Chase Oct 27 '11 at 04:18
  • Interesting question. A `data.table` approach might involve changing `draw` to a 3 column `data.table` (`i,atts,value`) keyed by `i,atts`. That's instead of `draw[[i]][, cols]` which'll be copying to get `draw[[i]]` then copying again to get `[,cols]`. Possibly, if `value` can be fixed precision at 3dp you could `*1000`, store as integer, and include that in the key, using `roll=TRUE` instead of the `min(which(...))` and `max(which(...)`. You might need a `list` column in `i` for the `breaks` and use join inherited scope (i.e. a non join column in `i`). Ok? ;) – Matt Dowle Oct 27 '11 at 14:40
  • @Matthew - well when you say it like that, it makes perfect sense! :) Can you clarify what you mean by the "can be fixed precision at 3dp..." bit of that? I think I understood the rest of what you typed. I am using `data.table()` for the rest of the aggregation, which is working quite quickly. Thank you for all of your work on this project, it is very much appreciated. – Chase Oct 27 '11 at 23:56
  • Sure. By that I meant if you can truncate your numeric data to a fixed number of decimal digits without losing accuracy. 3 was just because your showed 3 in your example data but could be any fixed number n so you can multiply by 10^n and truncate to integer (to force it into a key). I am working on allowing numeric key columns, though for future version. Adding @Iterator here in case he interested. – Matt Dowle Oct 28 '11 at 08:54
  • @Iterator You mentioned in chat the other day that `data.table` seems to do assignment by reference. Just to check you've found `?copy` and the example there. Also, we're having a discussion on [datatable-help](http://r.789695.n4.nabble.com/datatable-help-f2315188.html) about shallow and deep copies of `:=` if you interested. – Matt Dowle Oct 28 '11 at 08:59
  • this isn't the answer by any means, but I asked a similar question: http://stackoverflow.com/questions/7324111/calculate-product-variants-based-on-option-groups-and-options maybe you'll find it helpful in leading you to an answer – Francis Yaconiello Jan 06 '12 at 20:56
  • @Iterator - yeah I got a workable solution put together. I'll post what I ended up with when I get back to my work computer. I'm sure it can be refined further, but nothing that a few hours of amazon EC2 time couldn't fix. Thanks for your helpful suggestions and prods earlier! Cheers, -C – Chase Jan 16 '12 at 07:57
  • @Chase: Want to win a bounty on your own question? Give us the answer that 48 people are eager to see, and for which 6 have chosen to follow this question. :) – Iterator Feb 08 '12 at 17:11
  • Did you intend to interpolate between the overall max and min value, or between the two nearest values? For example, when price is 0.15 it looks like you are interpolating between the values at -0.2 and 0.2. But I would have thought it should be between 0 and 0.2. So instead of -3.69 for draw 1 of respondent 1, should it be -2.79? – dnlbrky Jun 18 '13 at 06:24
  • @dnlbrky - this code is pushing two years old at this point! However, you are correct in my intent to interpolate between the two nearest values, not the overall. I have identified a solution which does what I want and is reasonably fast, so should probably answer my own question at this point. – Chase Jun 18 '13 at 12:58
  • Yeah, I thought it sounded like an interesting problem and it had a lot of interest (at least two years ago) without a true answer. So I was working on a solution using `data.table` and wanted to understand the problem better. If nothing else my initial tinkering has helped me get more comfortable with creating something like the `LAG` and `LEAD` functions from SQL using `data.table`. Still don't have a final solution for this problem, but will post it if/when I do. – dnlbrky Jun 18 '13 at 13:18

1 Answers1

3

A somewhat unconventional way to get what you desire is to build a global ranking of all your products using your 10k draws.

Use each draw as a source of binary contests between the 10 products, and sum the results of these contests over all draws.

This will give you a final "leader-board" for your 10 products. From this you have relative utility across all consumers, or you can assign an absolute value based on the number of wins (and optionally, the "strength" of the alternative in each contest) for each product.

When you want to test a new product with a different attribute profile find its sparse(st) representation as a vector sum of (weighted) other sample products, and you can run the contest again with the win probabilities weighted by the contribution weights of the component attribute vectors.

The advantage of this is that simulating the contest is efficient, and the global ranking combined with representing new products as sparse vector sums of existing data allows much pondering and interpretation of the results, which is useful when you're considering strategies to beat the competition's product attributes.

To find a sparse (descriptive) representation of your new product (y) solve Ax = y where A is your matrix of existing products (rows as their attribute vectors), and y is a vector of weights of contributions from your existing products. You want to minimize the non-zero entries in y. Check out Donoho DL article on the fast homotopy method (like a power iteration) to solve l0-l1 minimization quickly to find sparse representations.

When you have this (or a weighted average of sparse representations) you can reason usefully about the performance of your new product based on the model set up by your existing preference draws.

The advantage of sparseness as a representation is it allows you to reason usefully, plus, the more features or product you have, the better, since the more likely the product is to be sparsely representable by them. So you can scale to big matrices and get really useful results with a quick algorithm.

Cris Stringfellow
  • 3,714
  • 26
  • 48