Javascript: Optimizing `reduce` for performance

Question

I'm using a .reduce method to iterate thru an array of objects in order to return the array index for the object that best fits certain conditions. My array has around 30,000 indexes right now and I'm shooting for well over a million. Trouble is, iterating thru the array using .reduce takes FOREVER!!! We're talking almost 4 seconds at the moment, imagine if the array had my projected 1 million indexes. The array is compact. I'm not connected to a database or server. Here is my code:

 var startMatchMaking = function () {
    var loopCounter = 0;
    var length = personArray.length;
    do {
        var manCounter = 0;
        loopCounter++;
        for (var i = length; i--;){
            if (!personArray[i].isSingle && personArray[i].sex === "Male" &&
                personArray[i].isAvailable === true) {
                manCounter++;            
                var num = normalRandomScaled(2.1, 12.44);

                var result = personArray.reduce(function(p,c,k,a){
                    return c.sex !== personArray[i].sex &&
                    !c.isSingle && c.isAvailable === true &&
                    c.age <= (personArray[i].age + num) &&
                    c.age >= (personArray[i].age - num) ? k : p;
                }, 0);

                result = !personArray[result].isSingle && 
                    personArray[result].sex !== personArray[i].sex &&
                    personArray[result].age <= (personArray[i].age + num) &&
                    personArray[result].age >= (personArray[i].age - num) ? result : -1;

                if (result >= 0) {
                    householdArray.push (new Household (personArray[i], personArray[result]));
                    personArray[result].isAvailable = false;
                    personArray[i].isAvailable = false;
                }
            }
        }
        document.write("<br>Mancounter is: " + manCounter +
                " loopCounter is: " + loopCounter + " households: " + householdArray.length);
    }
    while (manCounter > 0 && loopCounter <= 5);
};

startMatchMaking();

CONTEXT: I'm trying to develop a standalone application to run an agent-based model demographics simulation. personArray essentially contains 30,000 human beings. The particular bit of code above is related to the initial setup of the population. Persons were previously created and pushed to the array. Each Person object has a firstName, lastName, sex, age and isSingle property. They were assigned random values for each. At this stage in the program, I need to take the Persons who are meant to be not single, and match them with a suitable mate of the opposing sex and a compatible age to form households.

How can I optimize this in order to increase performance significantly? I'm open to small changes or completely different alternatives that would output the same result.

is your array compact? i.e. did you at any point `delete` an element from it? also you can capture `personArray[i]` in a local variable to (very probably) avoid `3 * personArray.length` array accesses. — BeyelerStudios, Aug 13 '16 at 14:32
What is `i` in that code? What is the context of that `.reduce()` call? There's really no way (in a browser running on a modern PC) that iterating through a 30,000 element array would take 4 seconds. — Pointy, Aug 13 '16 at 14:34
If you're invoking that code inside another loop, then *that's* your problem. — Pointy, Aug 13 '16 at 14:34
Are you using reduce correctly? Seems like `k` is being used in a weird place and where does `i` come from? — Sean Wang, Aug 13 '16 at 14:35
Whatever you're doing with that much data should probably done in a database, not with js. Especially since your code looks much like a query, not a reduction. — Bergi, Aug 13 '16 at 14:35
if not connecting to a server-side database, you may want to check https://www.npmjs.com/package/sqlite3 — Aprillion, Aug 13 '16 at 14:41
OK, there you go. You're iterating through `personArrray`, and on each iteration you iterate *again* via that `.reduce()` call. That means that the `.reduce()` callback will be invoked nine hundred million times. Nothing you do to speed up the `.reduce()` callback will help. — Pointy, Aug 13 '16 at 14:48
Basically what @Pointy is saying is: at the moment you're using brute-force, what you want to do is search for a better clustering-strategy. Step one would be to split your data up into categories `sex, single, available` and only search the right category/ies, i.e. removing that overhead of `personArray.length` times checking the wrong categories. Step two could be splitting those categories by age-ranges. This enables you to only search those ranges, that intersect with the age range your interested in. — BeyelerStudios, Aug 13 '16 at 15:03
Yeah, figured. Was kind of hoping that cutting down on the `.reduce` time with simple optimizations would help. Shaving a few milliseconds * nine hundred million would make a difference. I've added some context to my post, maybe someone will figure out a completely different way to get what I want. — snowfrogdev, Aug 13 '16 at 15:09
Replacing `.reduce()` with a `for` loop would make the code messier, but cut out all of the function calls, so that could speed things up. — nnnnnn, Aug 13 '16 at 15:17
@BeyelerStudios I think I understand what you are suggesting but I've never really done something like this before. Could you expand a little bit on your idea and how I could implement it. Maybe use some pseudo-code. You could post it as an official answer. Thanks. — snowfrogdev, Aug 13 '16 at 15:22

score 2 · Answer 1 · edited May 23 '17 at 12:01

You use reduce and iterate over all elements this way in a loop that also iterates over the elements as already mentioned in the comments. This leads to quadratic complexity. This means that if you double the number of persons the runtime of the algorithm is multiplied by 4. So, it is completely unfeasible to process millions of people this way.

It seems to me that the inner iteration over all elements is not necessary. You can replace reduce by an ordinary loop and stop iterating when a match is found. The current solution takes the last found match. Is there anything that makes the last one better than the first one? Or do I miss anything? What about randomly picking some indices when searching for a match and stopping when a match is found? This is a solution that does not require much changes and I expect it to make a big difference except for very young and very old people (outliers).

A solution that requires more changes is to map people by there properties as similarly already mentioned in the comments so that you can do something like matchCandidates = people[oppositeSex][ageWithSomeRandomness]. Have a look at this post for more information about possible implementations of maps and hashtables in Javascript.

An additional improvement might be achieved by filtering the people at the beginning so that no singles are included, i. e. copying people that are not single into a new array and only access the new array in the algorithm.

If your code runs in browsers you can use web workers to avoid that the browser freezes.

Arnauld · Accepted Answer · 2016-08-13T16:46:59.853

I think you'll need some pre-processing in order to speed this up.

For instance:

split population into men and women
sort men by age
iterate on the sorted array of men and compute a new set of matching women only when a new age is being processed
just pick women from the current set while it's up-to-date and not empty

EDIT: We can optionally optimize the number of matches by sorting the set of women by age difference, so that couples with a small age difference are created first.

Below is some example code.

var personArray = [];

// create test population
for(var n = 0; n < 30000; n++) {
  personArray.push({
    isSingle: Math.random() < 0.5,
    age: Math.round(18 + Math.random() * 80),
    sex: Math.random() < 0.5 ? 'M' : 'F',
    isAvailable: true
  });
}

var num = 7, // instead of num = normalRandomScaled(2.1, 12.44)
    sex = [ [], [] ],
    curAge = -1, subset,
    houseHold = [],
    ts = performance.now();

// split population into men & women
personArray.forEach(function(p) {
  sex[p.sex == 'M' ? 0 : 1].push(p);
});

// sort men by age
sex[0].sort(function(a, b) { return a.age - b.age; });

// iterate on men
sex[0].forEach(function(m) {
  if(m.age != curAge) {
    // create set of matching women for this age
    subset = sex[1].filter(function(w) {
      return w.isAvailable && w.isSingle && Math.abs(m.age - w.age) <= num;
    });
    // sort by age difference, so that women with
    // a small age difference are picked first
    subset.sort(function(a, b) {
      return Math.abs(m.age - b.age) - Math.abs(m.age - a.age);
    });
    curAge = m.age;
  }
  if(m.isSingle && subset.length) {
    // pick woman from set
    var w = subset.pop();
    m.isAvailable = false; // (not really necessary)
    w.isAvailable = false;
    houseHold.push([ m, w ]);
  }
});

console.log(
  'Found ' + houseHold.length + ' matches ' +
  'in ' + Math.round(performance.now() - ts) + 'ms'
);
console.log(
  'Random example:',
  houseHold[(Math.random() * houseHold.length) | 0]
);

Interesting, I think I can work with something like that. The only thing is that I really wanted my `householdArray` to contain `Household` objects with different properties. Two of those properties would be `man` and `woman` and the value for these would be the appropriate `Person` object from my `personArray`. The `Person` object would also have a `household` property and the value would be the appropriate `Household` object from the `householdArray`. This way, I could do things like`personArray[x].household.woman` or `household[x].man.age`. It's all very neat and organized in my head. — snowfrogdev, Aug 13 '16 at 16:57
@neoflash - Sure. I slightly simplified your original code so that it could be executed as a snippet without the definition of your `Household` object. Just use `householdArray.push(new Household(m, w))` and you should be all set to go. — Arnauld, Aug 13 '16 at 17:26

score 1 · Answer 3 · answered Aug 13 '16 at 18:10

As others and I have stated in the comments: your approach is bruteforce, which in this case is quadratic in the input size. There are several optimisation possibilities. For binary values (i.e. booleans) splitting the array into categories is trivial. Number values like age may be clustered, e.g. into ranges. And you should definitively adopt early abort as mentioned by mm759. TLDR: there's a table and conclusion at the bottom.

Consider the bruteforce approach (for reference):

// The result is a list of matches [[candidate.id, match.id (or -1)]]
function bruteforce(arr) {
  var matches = [];
  for(var i = 0; i < arr.length; ++i) {
    var candidate = arr[i], num = 12;
    var isCandidate = !candidate.isSingle && candidate.isAvailable && candidate.sex == 1;
    var cSex = candidate.sex;
    var cSexPref = candidate.sexPref;
    var cAgeMin = candidate.age - num;
    var cAgeMax = candidate.age + num;
    var result = !isCandidate ? -1 : arr.reduce(function(p,c,k,a){
      return k != i &&
        c.sex == cSexPref &&
        c.sexPref == cSex &&
        !c.isSingle && c.isAvailable &&
        c.age <= cAgeMax &&
        c.age >= cAgeMin ? k : p;
    }, -1);
    if(isCandidate)
      matches.push([i, result]);
  }
  return matches;
}

The category approach might look like this:

function useTheCategory(arr) {
  // preprocessing the data
  var wmNonsingleAvailables = [];
  var wwNonsingleAvailables = [];
  var mwNonsingleAvailables = [];
  var mmNonsingleAvailables = [];
  // split the data into categories
  arr.forEach(function(c) {
    if(!c.isSingle && c.isAvailable) {
      if(c.sex == 0) {
        if(c.sexPref == 1)
          wmNonsingleAvailables.push(c);
        else
          wwNonsingleAvailables.push(c);
      } else {
        if(c.sexPref == 0)
          mwNonsingleAvailables.push(c);
        else
          mmNonsingleAvailables.push(c);
      }
    }
  });

  var matches = [];
  for(var i = 0; i < arr.length; ++i) {
    var candidate = arr[i], num = 12;
    var isCandidate = !candidate.isSingle && candidate.isAvailable && candidate.sex == 1;
    var cSex = candidate.sex;
    var cSexPref = candidate.sexPref;
    var cAgeMin = candidate.age - num;
    var cAgeMax = candidate.age + num;
    if(isCandidate) {
      var category = null;
      // find the relevant category (in this case)
      // a more complex approach/split might include multiple categories here
      if(cSex == 0) {
        if(cSexPref == 1)
          category = mwNonsingleAvailables;
        else if(cSexPref == 0)
          category = wwNonsingleAvailables;
      } else if(cSex == 1) {
        if(cSexPref == 0)
          category = wmNonsingleAvailables;
        else if(cSexPref == 1)
          category = mmNonsingleAvailables;
      }
      var result = -1;
      if(category == null) {
        // always handle the error case...
        console.log("logic error: missing category!");
        console.log("candidate: " + JSON.stringify(candidate));
      } else {
        // the tests for matching sex/single/availability are left-overs and not necessarily required,
        // they are left in here to show that the reduce is not the culprit of your slowdown
        var match = category.reduce(function(p,c,k,a){
          return c.id != i &&
            c.sex == cSexPref &&
            c.sexPref == cSex &&
            !c.isSingle && c.isAvailable &&
            c.age <= cAgeMax &&
            c.age >= cAgeMin ? k : p;
        }, -1);
        // translate to arr index
        if(match != -1)
          result = category[match].id;
      }
      matches.push([i, result]);
    }
  }
  return matches;
}

And the age range bucket approach might look like this:

function useAgeRange(arr) {
  // preprocessing the data
  var ranges = [1, 2, 3, 4, 5]; // find appropriate ranges to spread the entries evenly (analyse your data, more below...)
  var ageBuckets = [];
  // find the range of age values
  var ageRange = arr.length == 0 ? [0, 0] : arr.reduce(function(p,c) {
    var min = c.age < p[0] ? c.age : p[0];
    var max = c.age > p[1] ? c.age : p[1];
    return [min, max];
  }, [arr[0].age, arr[0].age]);
  // build the buckets (floor for nicer values)
  for(var age = Math.floor(ageRange[0]), maxAge = ageRange[1], step = 0; age <= maxAge; age += step) {
    // update step size
    if(step == 0)
      step = ranges[0];
    else
      step = ranges[Math.min(ranges.length - 1, ranges.indexOf(step) + 1)];
    ageBuckets.push({
      nextAge: age + step,
      bucket: [],
    });
  }
  function findBucketIndex(age) {
    // min i with age < ageBuckets[i].nextAge
    for(var i = 0, maxi = ageBuckets.length - 1; i < maxi; ++i)
      if(age < ageBuckets[i + 1].nextAge)
        return i;
    return -1;
  }
  arr.forEach(function(c) {
    ageBuckets[findBucketIndex(c.age)].bucket.push(c);
  });

  var matches = [];
  for(var i = 0; i < arr.length; ++i) {
    var candidate = arr[i], num = 12;
    var isCandidate = !candidate.isSingle && candidate.isAvailable && candidate.sex == 1;
    var cSex = candidate.sex;
    var cSexPref = candidate.sexPref;
    var cAgeMin = candidate.age - num;
    var cAgeMax = candidate.age + num;
    if(isCandidate) {
      // Find range intersection with ageBuckets
      var startBucket = findBucketIndex(cAgeMin);
      var endBucket = findBucketIndex(cAgeMax);
      if(startBucket < 0) startBucket = 0;
      if(endBucket < 0) endBucket = ageBuckets.length - 1;
      var result = -1;
      // now only search those candidate buckets
      for(var b = startBucket; b <= endBucket; ++b) {
        var bucket = ageBuckets[b].bucket;
        var match = bucket.reduce(function(p,c,k,a){
          return c.id != i &&
            c.sex == cSexPref &&
            c.sexPref == cSex &&
            !c.isSingle && c.isAvailable &&
            c.age <= cAgeMax &&
            c.age >= cAgeMin ? k : p;
        }, -1);
        // translate to arr index
        if(match >= 0)
          result = bucket[match].id;
      }
      matches.push([i, result]);
    }
  }
  return matches;
}

I created a benchmark showing the improvements of the two approaches on jsfiddle. Both are effective by themselves (even including the preprocessing, values will differ across systems and browsers):

N       Search space  Brute force  Categories  Range buckets
         (#matches)          (relative timing values)
20000   2500          200          34          140
40000   5000          1400         180         556
80000   10000         5335         659         2582
160000  20000         17000        2450        16900

Analysing your data to find which approach is appropriate is everything: my benchmark generates an exponential distribution (ages 18-20 are 28% of the data points, 21-32 another 27%, 33-52 another 27% with 53-77 leaving about 18%). The range-approach doesn't deal well with this distribution as we can see in the timings above (that's for a fixed num = 12 years and 14 buckets) because for most of the queries an age range of 24 spans 55% of the data.

Javascript: Optimizing `reduce` for performance

3 Answers3