0

Suppose I have key, value pairs that comprise a userId and a list of boolean integers indicating that the user has an attribute:

userId     hasAttrA  hasAttrB  hasAttrC
joe               1         0         1
jack              1         1         0
jane              0         0         1
jeri              1         0         0

In Scala code, the data structure looks like:

var data = Array(("joe",  List(1, 0, 1)),
                 ("jack", List(1, 1, 0)),
                 ("jane", List(0, 0, 1)),
                 ("jeri", List(1, 0, 0)))

I would like to compute the fraction of all users that has the attributes. However, this computation requires that I can sum over all the keys, which I don't know how to do. So I would like to calculate:

  1. How many users there are?

data.size // 4

  1. What fraction of users has attribute A?

Should be: sum(hasAttrA) / data.size = 3/4 = 0.75

  1. What fraction of users has attribute B?

Should be: sum(hasAttrB) / data.size = 1/4 = 0.25

etc.

How can I compute the sums across all the keys, and how can I compute the final percentages?

EDIT 2/24/2016:

I can manually find the sums of individual columns like so:

var sumAttributeA = data.map{ case(id, attributeList) => attributeList(0)}.sum
var sumAttributeB = data.map{ case(id, attributeList) => attributeList(1)}.sum
var sumAttributeC = data.map{ case(id, attributeList) => attributeList(2)}.sum

var fractionAttributeA = sumAttributeA.toDouble/data.size
//fractionAttributeA: Double = 0.75
var fractionAttributeB = sumAttributeB.toDouble/data.size
//fractionAttributeB: Double = 0.25
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217

1 Answers1

2

One possible solution:

import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
import org.apache.spark.mllib.linalg.Vectors

val stats = sc.parallelize(data)
  .values
  .map(xs => Vectors.dense(xs.toArray.map(_.toDouble)))
  .aggregate(new MultivariateOnlineSummarizer)(_ add _, _ merge _)

(stats.count, stats.mean)
// (Long, org.apache.spark.mllib.linalg.Vector) = (4,[0.75,0.25,0.5])

You can also apply a similar operation manually:

val (total, sums) = sc.parallelize(data).values
  .map(vs => (1L, vs.map(_.toLong)))
  .reduce{
    case ((cnt1, vs1), (cnt2, vs2)) => 
    (cnt1 + cnt2, vs1.zip(vs2).map{case (x, y) => x + y})}

sums.map(_.toDouble / total)

but it will have much worse numerical properties.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • I think that is the best answer you can give for this question. Nevertheless, haven't you answered a similar question already ? – eliasah Feb 24 '16 at 07:32
  • 1
    @eliasah At some point all look the same :) Seriously though I couldn't find anything that could be safely used as a dupe. I've been thinking about this http://stackoverflow.com/q/35405314/1560062 Do you have the same in mind? – zero323 Feb 24 '16 at 11:56
  • Yes exactly that was the one I was thinking about, but I also agree with you, at some point they'll all look the same. – eliasah Feb 24 '16 at 11:59
  • If you think it is I can remove this one. I thought about adding some other options but to be honest this is the only reasonable option. I mean it is easy to code a naive solution by hand but any robust option requires rewriting online mean from scratch. – zero323 Feb 24 '16 at 12:20
  • I think that it's ok to keep this one as a generalization of the answer answer. – eliasah Feb 24 '16 at 12:35
  • @zero323: I found my own candidate solution and amended my original post. Can you please comment? – stackoverflowuser2010 Feb 24 '16 at 23:07
  • @stackoverflowuser2010 It should work just fine although starting a job per value is rather wasteful. – zero323 Feb 25 '16 at 12:26