Efficient algorithm for detecting different elements in a collection

Question

Imagine you have a set of five elements (A-E) with some numeric values of a measured property (several observations for each element, for example "heart rate"):

A = {100, 110, 120, 130}
B = {110, 100, 110, 120, 90}
C = { 90, 110, 120, 100}
D = {120, 100, 120, 110, 110, 120}
E = {110, 120, 120, 110, 120}

First, I have to detect if there are significant differences on the average levels. So I run a one way ANOVA using the Statistical package provided by Apache Commons Math. No problems so far, I obtain a boolean that tells me whether differences are found or not.

Second, if differences are found, I need to know the element (or elements) that is different from the rest. I plan to use unpaired t-tests, comparing each pair of elements (A with B, A with C .... D with E), to know if an element is different than the other. So, at this point I have the information of the list of elements that present significant differences with others, for example:

C is different than B
C is different than D

But I need a generic algorithm to efficiently determine, with that information, what element is different than the others (C in the example, but could be more than one).

Leaving statistical issues aside, the question could be (in general terms): "Given the information about equality/inequality of each one of the pairs of elements in a collection, how could you determine the element/s that is/are different from the others?"

Seems to be a problem where graph theory could be applied. I am using Java language for the implementation, if that is useful.

Edit: Elements are people and measured values are times needed to complete a task. I need to detect who is taking too much or too few time to complete the task in some kind of fraud detection system.

Very well formatted question. Depends what you mean by different element. Do you mean the element with the most difference edges? In the graph example you've presented so far it seems you would simply be looking for the element with the highest degree? — Pace, Feb 24 '10 at 13:53
Could you elaborate on your definition of "different" or "significant differences"? A naive approach would say all are different. But obviously, that's not what you're after. — sfussenegger, Feb 24 '10 at 13:56
@sfussenegger Thanks. By "different elements" I mean elements whose mean for the measured property is different in statistical terms. That is, when a statistically significant difference is found with a given interval of confidence (tipically 95%). http://en.wikipedia.org/wiki/Statistical_significance — Guido, Feb 24 '10 at 14:07
@Pace Thank you. I'll think about using the element with the highest degree. The example above was a simplified case. In a real scenario I can have hundreds of pairs "X is different than Y". Do you know any good Java library to efficiently work with graph structures ? — Guido, Feb 24 '10 at 14:09
My point was more that if all you're looking for is the highest degree then there is no need to create a graph at all. Simply iterate through your C-B difference and for each difference cast one vote for each element (one for C and one for B). At the end you can sort your votes and pick the element with the most. If you have a more complicated measure then you might want a graph. — Pace, Feb 24 '10 at 14:11
Two points: 1. if you want to know whether an element is different from *all* others, just count how many is it different from - it should be n-1 to be different from *all*. Otherwise, per Pace, you need to define what you mean. And how do you want to handle situations like A=B, B=C, but A!=C, which are bound to come up. 2. Unadjusted t-tests are a bad way to do pairwise comparisons after ANOVA. — Aniko, Feb 24 '10 at 14:22
With five elements, it can happen that A, C, E are different than B, D (not always an element is different from all others). That is A!=B A!=D B!=C B!=E C!=D D!=E. In this case, Pace's idea (use the element with a highest degree) works. I've just edited the question to reflect the real use case. @Aniko could you please elaborate about your point 2 about unadjusted t-tests being a bad way to do pairwise comparisions ? Thanks. — Guido, Feb 24 '10 at 14:47
At the very least you would want to use Fisher's LSD procedure which uses a pooled SD estimate, and thus has more degrees of freedom -> more power. But this method does not control the overall type I error rate if most means are equal, and only a few are different (i.e. exactly your situation). I would suggest Tukey's HSD. — Aniko, Feb 24 '10 at 16:26

score 4 · Accepted Answer · answered Feb 24 '10 at 23:01

Just in case anyone is interested in the final code, using Apache Commons Math to make statistical operations, and Trove to work with collections of primitive types.

It looks for the element(s) with the highest degree (the idea is based on comments made by @Pace and @Aniko, thanks).

I think the final algorithm is O(n^2), suggestions are welcome. It should work for any problem involving one cualitative vs one cuantitative variable, assuming normality of the observations.

import gnu.trove.iterator.TIntIntIterator;
import gnu.trove.map.TIntIntMap;
import gnu.trove.map.hash.TIntIntHashMap;
import gnu.trove.procedure.TIntIntProcedure;
import gnu.trove.set.TIntSet;
import gnu.trove.set.hash.TIntHashSet;

import java.util.ArrayList;
import java.util.List;

import org.apache.commons.math.MathException;
import org.apache.commons.math.stat.inference.OneWayAnova;
import org.apache.commons.math.stat.inference.OneWayAnovaImpl;
import org.apache.commons.math.stat.inference.TestUtils;


public class TestMath {
    private static final double SIGNIFICANCE_LEVEL = 0.001; // 99.9%

    public static void main(String[] args) throws MathException {
        double[][] observations = {
           {150.0, 200.0, 180.0, 230.0, 220.0, 250.0, 230.0, 300.0, 190.0 },
           {200.0, 240.0, 220.0, 250.0, 210.0, 190.0, 240.0, 250.0, 190.0 },
           {100.0, 130.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 },
           {200.0, 230.0, 150.0, 230.0, 240.0, 200.0, 210.0, 220.0, 210.0 },
           {200.0, 230.0, 150.0, 180.0, 140.0, 200.0, 110.0, 120.0, 150.0 }
        };

        final List<double[]> classes = new ArrayList<double[]>();
        for (int i=0; i<observations.length; i++) {
            classes.add(observations[i]);
        }

        OneWayAnova anova = new OneWayAnovaImpl();
//      double fStatistic = anova.anovaFValue(classes); // F-value
//      double pValue = anova.anovaPValue(classes);     // P-value

        boolean rejectNullHypothesis = anova.anovaTest(classes, SIGNIFICANCE_LEVEL);
        System.out.println("reject null hipothesis " + (100 - SIGNIFICANCE_LEVEL * 100) + "% = " + rejectNullHypothesis);

        // differences are found, so make t-tests
        if (rejectNullHypothesis) {
            TIntSet aux = new TIntHashSet();
            TIntIntMap fraud = new TIntIntHashMap();

            // i vs j unpaired t-tests - O(n^2)
            for (int i=0; i<observations.length; i++) {
                for (int j=i+1; j<observations.length; j++) {
                    boolean different = TestUtils.tTest(observations[i], observations[j], SIGNIFICANCE_LEVEL);
                    if (different) {
                        if (!aux.add(i)) {
                            if (fraud.increment(i) == false) {
                                fraud.put(i, 1);
                            }
                        }
                        if (!aux.add(j)) {
                            if (fraud.increment(j) == false) {
                                fraud.put(j, 1);
                            }
                        }
                    }           
                }
            }

            // TIntIntMap is sorted by value
            final int max = fraud.get(0);
            // Keep only those with a highest degree
            fraud.retainEntries(new TIntIntProcedure() {
                @Override
                public boolean execute(int a, int b) {
                    return b != max;
                }
            });

            // If more than half of the elements are different
            // then they are not really different (?)
            if (fraud.size() > observations.length / 2) {
                fraud.clear();
            }

            // output
            TIntIntIterator it = fraud.iterator();
            while (it.hasNext()) {
                it.advance();
                System.out.println("Element " + it.key() + " has significant differences");             
            }
        }
    }
}

score 0 · Answer 2 · answered Feb 24 '10 at 14:57

Your edit gives good details; thanks,

Based on that I would presume a fairly well-behaved distribution of times (normal, or possibly gamma; depends on how close to zero your times get) for typical responses. Rejecting a sample from this distribution could be as simple as computing a standard deviation and seeing which samples lie more than n stdevs from the mean, or as complex as taking subsets which exclude outliers until your data settles down into a nice heap (e.g. the mean stops moving around 'much').

Now, you have an added wrinkle if you assume that a person who monkeys with one trial will monkey with another. So you're erally trying to discriminate between a person who just happens to be fast (or slow) vs. one who is 'cheating'. You could do something like compute the stdev rank of each score (I forget the proper name for this: if a value is two stdevs above the mean, the score is '2'), and use that as your statistic.

Then, given this new statistic, there are some hypotheses you'll need to test. E.g., my suspicion is that the stdev of this statistic will be higher for cheaters than for someone who is just uniformly faster than other people--but you'd need data to verify that.

Good luck with it!

Thank you. In fact, I think that is what ANOVA (ANalysis Of VAriance) does under the hoods. — Guido, Feb 24 '10 at 15:36
Right, that thing. Been a while since stats class. So what is your question, then? Where a good ANOVA implementation can be found? — Alex Feinman, Feb 24 '10 at 20:08
Not really. The real problem is that ANOVA says there are differences, and I can even know if an element X is different than other element Y, but I don't know which one is different. — Guido, Feb 24 '10 at 22:09
Your distribution is well-behaved. So you can assume the outliers lie at the max or the min. Start pulling the outliers from the dataset, one by one, and recalculate the mean, until it stops moving so much, or until the change in stdev gets small. — Alex Feinman, Feb 25 '10 at 14:53

score 0 · Answer 3 · answered Feb 24 '10 at 20:51

You would have to run the paired t-test (or whatever pairwise test you want to implement) and the increment the counts in a hash where the key is the Person and the count is the number times it was different.

I guess you could also have an arrayList that contains people objects. The people object could store their ID and the counts of time they were different. Implement comparable and then you could sort the arraylist by count.

score 0 · Answer 4 · answered Feb 24 '10 at 21:19

If the items in the list were sorted in numerical order, you can walk two lists simultaneously, and any differences can easily be recognized as insertions or deletions. For example

List A    List B
  1         1       // Match, increment both pointers
  3         3       // Match, increment both pointers
  5         4       // '4' missing in list A. Increment B pointer only.

List A    List B
  1         1       // Match, increment both pointers
  3         3       // Match, increment both pointers
  4         5       // '4' missing in list B (or added to A). Incr. A pointer only.

Efficient algorithm for detecting different elements in a collection

4 Answers4