170

Is there is a way to measure how sorted a list is?

I mean, it's not about knowing if a list is sorted or not (boolean), but something like a ratio of "sortness", something like the coefficient of correlation in statistics.

For example,

  • If the items of a list are in ascending order, then its rate would be 1.0

  • If list is sorted descending, its rate would be -1.0

  • If list is almost sorted ascending, its rate would be 0.9 or some value close to 1.

  • If the list is not sorted at all (random), its rate would be close to 0

I'm writting a small library in Scala for practice. I think a sorting rate would be useful, but I don't find any information about something like that. Maybe I don't know adequate terms for the concept.

Josell
  • 1,924
  • 3
  • 13
  • 20
  • https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test – wildplasser Jun 08 '13 at 00:17
  • 4
    Would this be used to determine the ideal algorithm to sort the list? E.g. for values close to 0, QuickSort would be ideal, but values on either end of the scale (nearly sorted or nearly reverse sorted), MergeSort would be much faster, since QC devolves to O(N^2) in those cases. – Darrel Hoffman Jun 08 '13 at 14:04
  • 8
    +1 for "ratio of sortess" – David G Jun 08 '13 at 14:17
  • Do you need this measurement to be made in O(n) time, or is O(n log n) acceptable? – dan04 Jun 08 '13 at 15:29
  • Based on the popular answer http://stackoverflow.com/a/16994740/1168342 would such a function be of much practical use (i.e., you partially have to perform a sort to measure sortedness)? – Fuhrmanator Jun 08 '13 at 15:56
  • 1
    @Fuhrmanator The stochastic version of the algorithm does not have to perform a sort to arrive at a probabilistic estimate of the sortedness. It's only if you want to get an *exact* measure that you need to perform a sort. – Timothy Shields Jun 11 '13 at 20:19
  • 1
    Sarcastic but funny first instinct: You could insertion sort the list and see how long it takes, and then compare that to how long it takes to sort (the now sorted) list and the reverse of it. – kqr Jun 12 '13 at 00:45
  • related: [Minimum number of swaps needed to change Array 1 to Array 2?](http://stackoverflow.com/q/2987605/4279). It also covers O(n log n) algorithm for counting inversions. – jfs Jun 13 '13 at 02:51

10 Answers10

147

You can simply count the number of inversions in the list.

Inversion

An inversion in a sequence of elements of type T is a pair of sequence elements that appear out of order according to some ordering < on the set of T's.

From Wikipedia:

Formally, let A(1), A(2), ..., A(n) be a sequence of n numbers.
If i < j and A(i) > A(j), then the pair (i,j) is called an inversion of A.

The inversion number of a sequence is one common measure of its sortedness.
Formally, the inversion number is defined to be the number of inversions, that is,

definition

To make these definitions clearer, consider the example sequence 9, 5, 7, 6. This sequence has the inversions (0,1), (0,2), (0,3), (2,3) and the inversion number 4.

If you want a value between 0 and 1, you can divide the inversion number by N choose 2.

To actually create an algorithm to compute this score for how sorted a list is, you have two approaches:

Approach 1 (Deterministic)

Modify your favorite sorting algorithm to keep track of how many inversions it is correcting as it runs. Though this is nontrivial and has varying implementations depending on the sorting algorithm you choose, you will end up with an algorithm that is no more expensive (in terms of complexity) than the sorting algorithm you started with.

If you take this route, be aware that it's not as simple as counting "swaps." Mergesort, for example, is worst case O(N log N), yet if it is run on a list sorted in descending order, it will correct all N choose 2 inversions. That's O(N^2) inversions corrected in O(N log N) operations. So some operations must inevitably be correcting more than one inversion at a time. You have to be careful with your implementation. Note: you can do this with O(N log N) complexity, it's just tricky.

Related: calculating the number of “inversions” in a permutation

Approach 2 (Stochastic)

  • Randomly sample pairs (i,j), where i != j
  • For each pair, determine whether list[min(i,j)] > list[max(i,j)] (0 or 1)
  • Compute the average of these comparisons

I would personally go with the stochastic approach unless you have a requirement of exactness - if only because it's so easy to implement.


If what you really want is a value (z') between -1 (sorted descending) to 1 (sorted ascending), you can simply map the value above (z), which is between 0 (sorted ascending) and 1 (sorted descending), to this range using this formula:

z' = -2 * z + 1
Timothy Shields
  • 75,459
  • 18
  • 120
  • 173
  • 2
    It's kind of fascinating to me that sorting a list is (typically) O(n*logn), and the naive/obvious method of computing inversions is O(n^2). I wonder if there are better algorithms out there for computing the number of inversions? – Mark Bessey Jun 08 '13 at 01:48
  • 5
    There are a couple of interesting approaches in this SO question: http://stackoverflow.com/questions/6523712/calculating-the-number-of-inversions-in-a-permutation Basically, they amount to sorting the array in order to figure out how many inversions there are. – Mark Bessey Jun 08 '13 at 01:54
  • @MarkBessey Thanks for the link! I've incorporated it in the answer. – Timothy Shields Jun 08 '13 at 01:57
  • 4
    I naively thought you could just count adjacent pairs that are out of order. But that will severely undercount: 1 2 3 1 2 3 only has one adjacent inversion, but it's 50% inverted by the more correct measure. – Barmar Jun 11 '13 at 19:44
  • Great answer. Might it benefit from a precise definition of inversion? – Chris Calo Jun 11 '13 at 21:18
  • @ChristopherJamesCalo I pulled in the definition from the Wikipedia page I link to. – Timothy Shields Jun 11 '13 at 21:27
  • 1
    @TimothyShields, I meant define it in words. Also from Wikipedia: "In computer science and discrete mathematics, an inversion is a pair of places of a sequence where the elements on these places are out of their natural order." But maybe that's understood by this audience. I didn't know what it was. – Chris Calo Jun 12 '13 at 00:18
  • @ChristopherJamesCalo Is the following not a definition in words (albeit with some minimal math lingo)? -- Formally, let `A(1), A(2), ..., A(n)` be a sequence of `n` distinct numbers. If `i < j` and `A(i) > A(j)`, then the pair `(i,j)` is called an **inversion** of `A`. – Timothy Shields Jun 12 '13 at 00:26
  • 2
    @Barmar I think that list 1 2 3 1 2 3 would qualify as sorta sorted ;-) – scunliffe Jun 12 '13 at 01:30
  • 2
    @TimothyShields, well, no, it isn't. But I won't belabor the point. Just a suggestion to add a non-formal definition that is more accessible to the less symbolically inclined. – Chris Calo Jun 13 '13 at 02:11
  • @ChristopherJamesCalo Suggestion taken. :) – Timothy Shields Jun 13 '13 at 02:51
  • @Barmar The sequence `1, 2, 3, 1, 2, 3` has only `6` inversions, but `6 choose 2` is `15`. So it's actually 60% sorted, not 50%. – Timothy Shields Jun 24 '13 at 21:53
  • It seems like there might be another optimization approach where you move sequentially across the list in O(N), with each check asking if this is/is not inverted relative to the last. If you weighted agree/disagrees in the middle more highly than those towards the ends, and strong disagreements more highly than mild ones, you might get a similar answer to the O(N log N) approach. – Chris Moschini Jun 28 '13 at 23:39
  • @ChrisMoschini I invite you to try to come up with such a thing. :) I'll warn you that I have a pretty good hunch that computing the inversion number of a sequence using element-based comparisons is `Omega(n log n)` - meaning it *can't* be done with lower complexity - but it's not something I've proven. Computing the *list of inversions* (as opposed to the *inversion number*) is *definitely* `Omega(n log n)`. – Timothy Shields Jun 29 '13 at 16:07
  • @TimothyShields In the stochastic approach, need one prevent duplication of pairs or swapped pairs? (I'm sure there are formal terms for what I'm describing...) That is in the random testing, should one ensure that each pair tested is unique (beyond just ensuring that i != j for each pair). – ballenf Jan 14 '17 at 23:01
  • 2
    @BenFletcher Duplicate pairs do present any problem. All that is important is to choose pairs independently from one another. For small lists, you can exhaustively look at every possible pair and still be very fast. For larger lists, the difference in convergence speed between including duplicates or not including duplicates is extremely negligible. – Timothy Shields Jan 14 '17 at 23:15
  • @TimothyShields Thank you for the quick response! I researched more to realize that my question of whether duplicates should be prevented is asking in statistical terms whether 'with replacement' vs. 'without replacement' was important. As you said, it seems that the difference is more important (or only important) with a small population size. But, generally, 'with replacement', i.e., allowing duplicates, is preferred when possible. It certainly makes implementation of this algorithm much easier. – ballenf Jan 15 '17 at 21:00
  • I'm not sure I understand normalizing by N choose 2 in the stochastic case. That means the larger the list, the smaller the maximum sortedness score. For example, the sortedness score of a reversed list of size 2 is 1, 3 is 0.33..., 4 is 0.16... since the unnormalized score is 1, which is then divided by 1, 3, or 6, respectively. If 1 is supposed to indicate a reversed list, shouldn't the score just be the average score of the samples? – elhefe Jun 22 '22 at 22:00
  • Another question - shouldn't the inequality be less than or equal to rather than less than? Less than counts identical entries as unsorted IIUC – elhefe Jun 22 '22 at 22:21
  • I needed an implementation so https://gist.github.com/MrCreosote/b72a5dd0248b7a601a2944501a0b1fc0 – elhefe Jun 22 '22 at 22:35
25

The traditional measure of how sorted a list (or other sequential structure) is, is the number of inversions.

The number of inversions is the number of pairs (a,b) st index of a < b AND b << a. For these purposes << represents whatever ordering relation you choose for your particular sort.

A fully sorted list has no inversions, and a completely reversed list has the maximum number of inversions.

Marcin
  • 48,559
  • 18
  • 128
  • 201
18

You can use actual correlation.

Suppose that to each item in the sorted list, you assign an integer rank starting from zero. Note that a graph of the elements position index versus rank will look like dots in a straight line (correlation of 1.0 between the position and rank).

You can compute a correlation on this data. For a reverse sort you will get -1 and so on.

Kaz
  • 55,781
  • 9
  • 100
  • 149
  • 1
    I'm sorry, but this leaves too much unexplained, like how you assign the integers. – Marcin Jun 08 '13 at 00:56
  • 2
    You need the sorted list to assign the integers; then it is just an enumeration of the items. – Kaz Jun 08 '13 at 01:08
  • 1
    Exactly what I was going to suggest. Determine the correlation between the position of the object in the original list and its position in the sorted list. The bad news is that correlation routines probably run in O(n^2); the good news is they are probably off-the-shelf for your environment. – Peter Webb Jun 08 '13 at 09:41
  • 2
    Yeah, just Spearman's rho http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient – Lucas Jun 08 '13 at 14:52
  • I'm curious... is this approach equivalent to scaling the count of the number of inversions? – Clayton Stanley Jun 12 '13 at 08:10
7

There has been great answers, and I would like to add a mathematical aspect for completeness:

  • You can measure how sorted a list is by measuring how much it is correlated to a sorted list. To do that, you may use the rank correlation (the most known being Spearman's), which is exactly the same as the usual correlation, but it uses the rank of elements in a list instead of the analog values of its items.

  • Many extensions exist, like a correlation coefficient (+1 for exact sort, -1 for exact inversion)

  • This allows you to have statistical properties for this measure, like the permutational central limit theorem, which allows you to know the distribution of this measure for random lists.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
meduz
  • 3,903
  • 1
  • 28
  • 40
3

Apart from inversion count, for numeric lists, mean square distance from the sorted state is imaginable:

#! ruby
d = -> a { a.zip( a.sort ).map { |u, v| ( u - v ) ** 2 }.reduce( :+ ) ** 0.5 }

a = 8, 7, 3, 4, 10, 9, 6, 2, 5, 1
d.( a ) #=> 15.556
d.( a.sort ) #=> 0.0
d.( a.sort.reverse ) # => 18.166 is the worrst case
Boris Stitnicky
  • 12,444
  • 5
  • 57
  • 74
  • I think that's the square of the standard correlation function, see http://en.wikipedia.org/wiki/Correlation_ratio . And applies equally to non-numeric lists; the two values which are compared are the object's position in the two lists. – Peter Webb Jun 08 '13 at 09:44
  • I am a simpleton. I don't even know what correlation ratio is. When I read that Wikipedia article, right at the top, I'm asked to learn what "statistical dispersion" is, then "standard deviation", then "variation", then "interclass correlation coefficient". I learnt all of that, several times, and several times, I forgot it again. In this pragmatic answer of mine, I simply measure the distance between the two vectors with Pythagoras theorem, that I remember from the elementary school, that's all. – Boris Stitnicky Jun 08 '13 at 23:14
2

I am not sure of the "best" method, but a simple one would be to compare every element with the one after it, incrementing a counter if element2 > element 1 (or whatever you want to test) and then divide by the total number of elements. It should give you a percentage.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user2369405
  • 217
  • 1
  • 4
  • 10
1

I would count comparisons and divide it to the total number of comparisons. Here is a simple Python example.

my_list = [1,4,5,6,9,-1,5,3,55,11,12,13,14]

right_comparison_count = 0

for i in range(len(my_list)-1):
    if my_list[i] < my_list[i+1]: # Assume you want to it ascending order
        right_comparison_count += 1

if right_comparison_count == 0:
    result = -1
else:
    result = float(right_comparison_count) / float((len(my_list) - 1))

print result
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ibrahim
  • 3,254
  • 7
  • 42
  • 56
0

If you take your list, calculate the ranks of the values in that list and call the list of ranks Y and another list, X that contains the integers from 1 to length(Y), you can obtain exactly the measure of sortedness that you are looking for by calculating the correlation coefficient, r, between the two lists.

r = \frac{\sum ^n _{i=1}(X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum ^n _{i=1}(X_i - \bar{X})^2} \sqrt{\sum ^n _{i=1}(Y_i - \bar{Y})^2}} 

For a fully-sorted list, r = 1.0, for a reverse-sorted list, r=-1.0, and the r varies between these limits for varying degrees of sortedness.

A possible problem with this approach, depending on the application, is that calculating the rank of each item in the list is equivalent to sorting it, so it is an O(n log n) operation.

Simon
  • 10,679
  • 1
  • 30
  • 44
  • But that won't ignore the curve shape. If his array is sorted, but, say, contains values increasing exponentially, the correlation will be small where he wants it to be 1.0. – Lee Daniel Crocker Jun 08 '13 at 00:28
  • @LeeDanielCrocker: Yes, that's a good point. I've amended my answer to address this by taking ranks of the values. – Simon Jun 08 '13 at 00:30
0

How about something like this?

#!/usr/bin/python3

def sign(x, y):
   if x < y:
      return 1
   elif x > y:
      return -1
   else:
      return 0

def mean(list_):
   return float(sum(list_)) / float(len(list_))

def main():
   list_ = [ 1, 2, 3, 4, 6, 5, 7, 8 ]
   signs = []
   # this zip is pairing up element 0, 1, then 1, 2, then 2, 3, etc...
   for elem1, elem2 in zip(list_[:-1], list_[1:]):
      signs.append(sign(elem1, elem2))

   # This should print 1 for a sorted list, -1 for a list that is in reverse order
   # and 0 for a run of the same numbers, like all 4's
   print(mean(signs))

main()
dstromberg
  • 6,954
  • 1
  • 26
  • 27
  • 2
    This only counts adjacent inversions. If you look at the other answers you’ll see that this is insufficient. – Konrad Rudolph Jun 08 '13 at 11:56
  • 1
    @KonradRudolph: I think this answer satisfies the question as asked. The fact that other answers are more comprehensive doesn't mean that this one is insufficient; it depends on the OP's requirements. – LarsH Jun 08 '13 at 14:52
0

We could sort our L1 list with the criteria we want and produce L2 list which would be our ideal sorting. Then we could calculate the Levenstein distance of each pair of items between L2 and L1 and sum the distances. The further away from zero, the most unsorted L1 is.