12

I am trying to come up with a method to regression test number sequences.

My system under tests produces a large amount of numbers for each system version (e. g. height, width, depth, etc.). These numbers vary from version to version in an unknown fashion. Given a sequence of "good" versions and one "new" version I'd like to find the sequences which are most abnormal.

Example:

"Good" version:

version    width   height   depth
   1        123      43      302 
   2        122      44      304
   3        120      46      300
   4        124      45      301

"New" version:

   5        121      60      305

In this case I obviously would like to find the height sequence because the value 60 stands out more than the width or the depth.

My current approach computes the mean and the standard deviation of each sequence of the good cases and for a new version's number it computes the probability that this number is part of this sequence (based on the known mean and standard deviation). This works … kind of.

The numbers in my sequences are not necessarily Gaussian distributed around a mean value but often are rather constant and only sometimes produce an outlier value which also seems to be rather constant, e. g. 10, 10, 10, 10, 10, 5, 10, 10, 10, 5, 10, 10, 10. In this case, only based on mean and standard deviation the value 10 would not be 100% likely to be in the sequence, and the value 5 would be rather unlikely.

I considered using a histogram approach and hesitated there to ask here first. The problem with a histogram would be that I would need to store quite a lot of information for each sequence (in contrast to just a mean and standard deviation).

The next aspect I thought about was that I am pretty sure that this kind of task is not new and that there probably are already solutions which would fit nicely to my situation; but I found not much in my research.

I found a library like PyBrain which at first glance seems to process number sequences and then apparently tries to analyse these with a simulated neural network. I'm not sure if this would be an approach for me (and again it seems like I would have to store a large amount of data for each number sequence, like a complete neural network).

So my question is this:

Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)? Preferably while storing only small amounts of data per sequence ;-)

For concrete implementations I'd prefer Python, but hints on other languages would be welcome as well.

Alfe
  • 56,346
  • 20
  • 107
  • 159
  • Yeah, I know, I ask for a library ;-) But please fight the reflex to quickly mark this as off-topic because it being a question for a library. If you read the whole question carefully you'll find that I describe a lot of what I tried and considered and the problems of these approaches. The question I have is not "which library is best for this" (which the off-topic thing aims to kick out) but "is there any technology tackling this issue". I don't think that's off-topic here. – Alfe Feb 17 '17 at 16:18
  • I updated my answer, have a look at it if it fulfills your requirements now – user7291698 Feb 24 '17 at 12:55
  • Having thought about it, I think you should try to minimise tests like these in a test harness. What you should try and do instead is refactor out the source of your uncertainty (a dependency on a random number generator) and then using dependency injection, inject a deterministic mock at test time so that you don't experience test uncertainty. – rnoodle Feb 28 '17 at 08:51
  • Of course, if I could change the SUT I could find more intelligent ways of testing. But the SUT is part of another department. I'm just supposed to find the most peculiar spots in the SUT's output. – Alfe Feb 28 '17 at 16:04

4 Answers4

1

Is there a technique, an algorithm, or a science discipline out there which would help me analyse number sequences to find abnormalities (in a last value)?

The scientific displine you are looking for is called outlier detection / anomaly detection. There are a lot of techniques and algorithms you can use. As a starting point, maybe have a look at wikipedia here (outlier detection) and here (Anomaly detection). There is also a similar question on stats.stackexchange.com and one on datascience.stackexchange.com that is focused on python.

You also should think about what is worse in your case, false positives (type 1 error) or false negatives (type 2 error), as decreasing the percentage of one of these error types increases the percentage of the other.

EDIT: given your requirement with multiple peaks in some cases, flat distributions in other cases, an algorithm like this could work:

1.) count the number of occurrences of each single number in your sequence, and place the count in a bin that corresponds to that number (initial bin width = 1)

2.) iterate through the bins: if a single bin counts more than e.g. 10% (parameter a) of the total number of values in your sequence, mark the numbers of that bin as "good values"

3.) increase the bin width by 1 and repeat step 1 and 2

4.) repeat step 1-3 until e.g. 90% (parameter b) of the numbers in your sequence are marked as "good values"

5.) let the test cases for the bad values fail

This algorithm should work for cases such as:

  • a single large peak with some outliers

  • multiple large peaks and some outliers in between

  • a flat distribution with a concentration in a certain region (or in multiple regions)

  • a number sequences where all numbers are equal

Parameters a and b have to be adjusted to your needs, but I think that shouldn't be hard.

Note: to check to which bin a value belongs, you can use the modulo operator (%), e.g. if bin size is 3, and you have the values 475,476,477,478,479 name the bin according to the value where its modulo with the bin size is zero -> 477%3=0 -> put 477, 478, and 479 into bin 477.

Community
  • 1
  • 1
user7291698
  • 1,972
  • 2
  • 15
  • 30
1

You could use a a regression technique called Gaussian process (GP) to learn the curve and then apply the gaussian process to the next example in your sequence.

Since a GP does not only give you an estimate for the target but also a confidence you could threshold based on the confidence to determine what is an outlier.

To realize this various toolboxes exist (scikits.learn, shogun,...) but what is likely easiest is GPy. An example for 1d regression that you can tune to get your task going is nicely described in the following notebook:

http://nbviewer.jupyter.org/github/SheffieldML/notebook/blob/master/GPy/basic_gp.ipynb

  • Installing it was a PITA, but the results I got with this are what I expected :-) Alas, it only predicts values with one single peak in the histogram (spiky or flat, depending on the input data), so an input of values like (5, 2, 5, 2, 5, 5, 2, 2, 2, 5, 2, 5, 5, 2, 2, 5) will gladly accept a 3 or 4 although I would call that a clear novelty / outlier. If someone can offer a similarly simple-to-use library which also learns things like this, you're beaten; otherwise you'll get the bounty :) – Alfe Feb 23 '17 at 22:04
0

I wonder if different columns in your data can be treated in different ways? Is it appropriate to, for example treat the width with a "close to the mean" check; another column with "value seen in set of good examples"; a third column maybe treated by "In existing cluster from K-means clustering of good examples".

You could score for each column and flag any new value that has any one or more columns not deemed to fit and state why.

Hmm, it's not restricted to individual columns - if, for example, there is some relation between column values then that could be checked for too - maybe width times height is limited; or the volume has limits.

Time: It may be that successive values can only deviate in some given manner by some value - If, for example the sides were continuously varied by some robot and the time between measurements was short enough, then that would limit the delta values between successive readings to that which the robotic mechanism could produce when it is working correctly.

I guess a large part of this answer is to use any knowledge you have about the data source to help.

Paddy3118
  • 4,704
  • 27
  • 38
  • I'm looking for a fully automated approach to tackle the cases in which I have an arbitrary and very large number of such sequences. In fact in my use case I have hundreds of such sequences and in my regression tests I get false positives all the time due to slight variations in the values. So any approach involving manually checking sequences is no good for me. I want something which I can feed the values into and which then "figures out" what the values can look like so that for a new value it gives a likelihood of it being a good value or not. – Alfe Feb 19 '17 at 12:08
0

I am not sure if I understand you correctly, but I think you want to predict if a sample presented to you (after experiencing a sequence of previous examples) is anomalous or not? You are therefore implying some sort of temporal dependency of the new sample?

If you have lots of training data i. e. (hundreds or thousands of) examples of (labelled) good and bad sequences, then you might be able to train a neural architecture to classify if the 'next element in the sequence' is anomalous or not. You could train an LSTM (long short-term memory) architecture that would generalise over input sequences to accurately classify the new sample presented to the architecture.

LSTMs will be available in any good neural network library and basically you will be running a general Supervised Learning routine. Tutorials about this are all over the Internet and in any good machine learning (ML) book.

As always in ML, take care of not over-fitting!

rnoodle
  • 534
  • 2
  • 5
  • 21
  • Training standard 'feedforward' neural architectures is tricky but reasonably well understood now (compared to 10 years ago). Training recurrent architectures (like LSTMs) is even more tricky. From my own experience, you won't get a neural solution immediately. Instead you will need to get your hands dirty in the whole discipline of training these networks. – rnoodle Feb 27 '17 at 21:26