4

I have a set of samples that is sorted, but due to errors in the data sometimes unsorted values creep in. I need to detect these values and remove them. I'll show some sample datasets below.

20 30 21 22 23 24 25

30 31 21 22 23 24 25

30 21 22 23 24 25 26

20 21 22 23 18 25 26

20 15 21 22 23 24 25

In each case the bold italic numbers are the ones that should be removed. What would be an algorithm to do remove these numbers/detect the indices of these numbers?

Community
  • 1
  • 1
diAblo
  • 337
  • 4
  • 14
  • My question is what would be an algorithm to achieve the same. I can explain some approaches I tried if that helps – diAblo Jan 15 '16 at 11:19
  • 2
    Read the array from low index to high index (or vice versa, it hardly matters) and throw out (or otherwise mark) the out-of-sequence values. You're not going to find a general approach with complexity lower than `O(n)`. What else might you be thinking of ? – High Performance Mark Jan 15 '16 at 11:23
  • @HighPerformanceMark That fails for 2nd test case. You will need to do it repeatidly with this approach, and that's not O(n). – amit Jan 15 '16 at 11:26
  • O(n) is what I am looking for. I want to know how to mark the out-of-sequence elements? Can't use arr[k + 1] < arr[k] as far as I can see. – diAblo Jan 15 '16 at 11:28
  • 1
    You should be looking for the [longest increasing subsequence](https://en.wikipedia.org/wiki/Longest_increasing_subsequence). See also: [Remove unsorted/outlier elements in nearly-sorted array](http://stackoverflow.com/a/32217310/3789665) – greybeard Jan 15 '16 at 11:30
  • @greybeard Thanks. That should solve my problem. Out of curiosity, if the second case was not there, i.e out of sequence elements were limited to a single digit, what would be the steps? Might be my sleep ridden brain, but I can't seem to figure out a way for that either. – diAblo Jan 15 '16 at 11:40

3 Answers3

3

Detecting is relatively simpler and takes less steps - you could do it in O(n) time. Simply iterate over the array and compare each element to the next. You would be able to find (and mark indices or throw-out) out-of-sequence numbers.

However, your second case makes doing this a problem. I will assume that you always want to keep the longest increasing sub-sequence of the list of numbers (like in the 2nd case).

You can solve this problem efficiently with arrays and binary search. The algorithm performs a single binary search per sequence element, its total time can be expressed as O(n log n).

Process the sequence elements in order, maintain the longest increasing sub-sequence found so far. Denote the sequence values as X[0], X[1], etc. L representing the length of the longest increasing sub-sequence found so far.

M[j] stores the index k of the smallest value X[k] such that there is an increasing sub-sequence of length j ending at X[k] on the range k ≤ i. j ≤ k ≤ i always. P[k] stores the index of the predecessor of X[k] in the longest increasing sub-sequence ending at X[k]

The sequence X[M[1]], X[M[2]], ..., X[M[L]] is always nondecreasing at all points of the algorithm.

P = array of length N
M = array of length N + 1 // Using a 1 indexed array for ease of understanding

L = 0
for i in range 0 to N-1:
   // Binary search
   lo = 1
   hi = L
   while lo ≤ hi:
       mid = ceil((lo+hi)/2)
       if X[M[mid]] < X[i]:
           lo = mid+1
       else:
           hi = mid-1

newL = lo

P[i] = M[newL-1]
M[newL] = i

if newL > L:
    L = newL

S = array of length L
k = M[L]
for i in range L-1 to 0:
    S[i] = X[k]
    k = P[k]

return S

The pseudo-code can be found on the Wikipedia article for this.

If you do want to keep the out-of-sequence elements in the list, simply sort the array using insertion sort.

Madhav Datt
  • 1,065
  • 2
  • 10
  • 25
  • Please give pseudo-code for detection in case of a single unsorted number. I can't iterate over and do "if(arr[k + 1] < arr[k]) throw arr[k + 1]" since that would fail in my 1st case, and I can't do "if(arr[k+1] < arr[k]) throw arr[k]" since that would fail in my 4th case. – diAblo Jan 15 '16 at 12:22
  • 1
    `if(arr[k-1] < arr[k] && arr[k] > arr[k+1] || arr[k-1] > arr[k] && arr[k] < arr[k+1]) throw arr[k]` Handle corner indices separately. – Madhav Datt Jan 15 '16 at 12:46
1

Detecting only

It takes at least N-1 steps to check (check each element and next), to do it.

But it is ambiguous: In list 2, what is wrong ? 30/31, or 21/.../25 ?

If bad numbers are isolated, you just remove them. But if you have, say, 2 numbers, what to do ? You must define more rules.

Detecting, and sorting :

Complexity:

If your list is perfectly sorted, it takes N-1 steps (check each element and next), to do it.

If there is one unsorted element, it takes log N to replace it at good place (if I suppose everything else is sorted, and in a structure ad hoc like binary tree).

It there are k unsorted elements, it will take k log N .

So N (check) + k log N (insert).

And if everything is messed, N log N, which is classical complexity for sorting.

Algorithm:

So, simpliest algorithm is to iterate, and insert at good place, in a balanced tree. It is a sort by insertion.

It is like smoothsort: https://en.wikipedia.org/wiki/Smoothsort

-1

I think this should work for you. It finds the longest subsequence and then clears up the other elements. The implementation is in c#

public static void Main() {
    int[][] dataList = {
                        new []{20,30,21,22,23,24,25},
                        new []{30,31,21,22,23,24,25},
                        new []{30,21,22,23,24,25,26},
                        new []{20,21,22,23,18,25,26},
                        new []{20,15,21,22,23,24,25}
                    };

    foreach (var data in dataList)
        DetectAndRemoveUnsorted(data);
}

/// <summary>
/// Assumes ascending data. You can adapt it for descending data too
/// </summary>
static void DetectAndRemoveUnsorted(IList<int> data) {
    // first pass: Find the outliers; rather find the correct sequence
    int startOfLongestSeq = 0, lenOfLongestSeq = 0;
    int startOfCurrSeq = 0, lenOfCurrSeq = 0;
    for (int i = 0; i < data.Count - 1; ++i) {
        if (data[i] > data[i + 1]) { // we are breaking the ascending order, so this is another sequence
            lenOfCurrSeq = i - startOfCurrSeq + 1;
            if (lenOfCurrSeq > lenOfLongestSeq) {
                lenOfLongestSeq = lenOfCurrSeq;
                startOfLongestSeq = startOfCurrSeq;
            }
            startOfCurrSeq = i + 1;
        }
    }

    lenOfCurrSeq = data.Count - startOfCurrSeq;
    if (lenOfCurrSeq > lenOfLongestSeq) {
        lenOfLongestSeq = lenOfCurrSeq;
        startOfLongestSeq = startOfCurrSeq;
    }


    // second pass: cleanup outliers

    // now we know which sequence is the largest
    // we should get rid of the other sequences
    for (int i = startOfLongestSeq - 1; i >= 0; --i)
        data[i] = -1; // Mark them as invalid. if you want, you can delete them as well

    for (int i = data.Count - 1; i >= startOfLongestSeq + lenOfLongestSeq; --i)
        data[i] = -1; // Mark them as invalid. if you want, you can delete them as well
}
Vikhram
  • 4,294
  • 1
  • 20
  • 32