1

I have a lot of music and i want to rank them from least favorite to favorite (this will take many days). I would like to compare two music files at a time (2-way comparison). I saw some questions on algorithms with the fewest comparisons. But the catch is that (since it's a long process) i want to add new music to the collection and in that case i don't want to start over sorting everything (thus creating a lot more comparison steps).

Which algorithm has the least amount of comparisons while still allowing new elements to be added which need to be compared too?

I'm not interested in least amount of comparisons for just a few items. Let's say 1000 items minimum.

Bonus if the algorithm supports N-way comparison (where N > 2) in case i would like to compare pictures instead.

EDIT: comparing two songs are a manual process by listening to them (thus slowly), the sorting algorithm is needed to rank them in the fewest amount of comparisons

Flip
  • 4,778
  • 1
  • 34
  • 48
  • I guess it's not just a question of the sorting algorithm but especially of the data structure you use to store your music rankings. Choose a tree structure of your favor to increase insert performance compared to i. e. an array. – SparkFountain Jan 16 '20 at 13:14
  • If you are writing a computer program then worrying about this for 1,000 elements is unnecessary unless you need to keep all 1,000 songs in memory for some reason. Something about this post though makes me think you are doing this by hand. If you want to use a sorting algorithm you need to have well defined scores for each song BEFORE sorting and once you have that just use Excel or Google docs to sort if you don't want to write code. – gph Jan 16 '20 at 13:32
  • @gph That's mostly true, but you're assuming that songs are compared by each song having a score, and then comparing the scores (i.e. sorting using a "key" function). If the songs are compared by listening to both and deciding which one you prefer, then there is no "key" function and you are sorting using a "comparator" function, albeit one which depends on user input. – kaya3 Jan 16 '20 at 14:44
  • @gph "i want to rank them" is that i want to make the comparison between two songs by listening to the music. I need the sorting algorithm to do the ranking in the least amount of steps. After all songs are ranked i can define scores for them. – Flip Jan 16 '20 at 17:19
  • @kaya3 Agreed. The risk I was trying to sweep under the rug is when you rely on a subjective metric you could have A – gph Jan 16 '20 at 17:24
  • @Flip Ah, Ok. That's interesting...and IIUC much harder. You would need a separate algorithm to convert the relative preferences into a ranking, no? – gph Jan 16 '20 at 17:27
  • @gph Yes, that's true; but an algorithm which performs a minimal number of comparisons would never detect such a cycle, because it would never do the redundant comparison between A and C after already comparing A,B and B,C; so there would be no problem for the program, only for the user that sometimes there would be two songs "out of order" relative to each other (but there always would be, if the user's preferences aren't transitive). – kaya3 Jan 16 '20 at 17:28
  • @kaya3 yep, agreed. – gph Jan 16 '20 at 17:31
  • If you search in your browser for "sorting and searching algorithms", you'll find references that can explain this much better than we can manage here. – Prune Jan 16 '20 at 20:40

3 Answers3

4

There seem to be two stages in your problem. The first stage is to sort all of the songs you already have, and the second stage is to insert new songs, one-by-one, into the already-sorted order.


The first stage is what standard sorting algorithms do. In this stage, the input is an array presumed to be completely unordered, and all of the sorting is done at once. You want to do this using the minimum number of comparisons possible.

There is no perfect answer to this question; no known sorting algorithm uses a provably minimum number of comparisons for all inputs. Information theory gives n log₂ n - 1.443 n + O(log n) as a theoretical lower bound for the average number of comparisons, but this bound has not been achieved.

The currently-known sorting algorithms which get closest to the above bound are merge-insertion sort (also known as the Ford–Johnson algorithm), and variations of it. Merge-insertion sort performs on average approximately n log₂ n - 1.415 n comparisons, which is very close to the theoretical bound. For 1024 items, you'd probably be doing something like ~8,790 comparisons, where the theoretical bound is like ~8,760.

According to this other Stack Overflow answer as of December 2018, none of the algorithms which improve on merge-insertion sort are "freely documented", which I take to mean that these improved algorithms are only presented in academic papers. More public information is available for merge-insertion sort, and there is not much room for the variants to improve on it, so I would suggest going with this algorithm rather than wading through academic literature; unless your n is much larger, there is little to gain from it.


The second stage is a different problem than what sorting algorithms solve. In this stage, you need an "online" algorithm which allows adding new items into the current sorted order.

You cannot do this with fewer than ⌈log₂ (n + 1)⌉ comparisons per insertion, because there are n + 1 positions the new item could belong in the current order, and each comparison gives one bit of information.

The binary search algorithm works to find the correct position in a sorted array; or you could use a balanced binary search tree data structure. Either way, each insertion will be achieved using the optimal number of comparisons. The advantage of using a binary search tree is that insertion takes O(log n) time overall; inserting into a sorted array takes O(log n) comparisons but O(n) time to move other elements around in the array.

kaya3
  • 47,440
  • 4
  • 68
  • 97
  • binary search is one half of the puzzle. The other half is which sorting algorithm to use – Flip Jan 16 '20 at 17:22
  • To sort the songs which i already have. So i can answer questions like "show my top 10 songs" will still keeping open the ability to add new songs later (and not having to sort everything all over again). – Flip Jan 16 '20 at 18:12
  • I put a bounty. Maybe you can elaborate your answer more. – Flip Jan 26 '20 at 17:10
  • @Flip I've expanded the answer to cover the "first stage" where you sort all the songs you already have. Regarding N-way comparisons, I think that is too large a topic for me to cover adequately, and it's less well-researched than sorting with pairwise comparisons, but potentially you could sort your songs faster that way so hopefully somebody else will answer covering that option. – kaya3 Jan 26 '20 at 18:40
  • 1
    I think separating "stages" misses the point of the question: arrival of new items *during the many days sorting will take*. Alas, the question could spell out more explicitly if answers to queries as *current top nine* shall be cheap even before sorting is completed. – greybeard Jan 27 '20 at 08:26
  • @greybeard That's an interesting take and not one that I had considered. Merge-insertion sort allows for there to be an "odd element" in one half when a subarray is split; when an even-length sub-array is being sorted and a "new song" needs to be added, you could just pretend it was the "odd element" from that subarray. That only allows for logarithmically many new songs, but if there are a lot of new songs you can insert them in pairs; it's easier to insert them into unsorted subarrays, but you should also be able to insert them one each into two sorted subarrays that are about to be merged. – kaya3 Jan 27 '20 at 17:42
  • I like your answers because it's really well motivated on the quality of the algorithms. However like you said 1024 songs still need 8760 comparisons. But actually i have a lot more songs than 1024, so i will never be able to complete sorting. While theoretically optimal it's not an answer i can implement practically. – Flip Feb 01 '20 at 14:04
  • i can not edit my comment, i wanted to say: i can NOT perform the comparisons practically – Flip Feb 01 '20 at 14:11
  • In that case you will need to go the multi-way comparison route, or not sort the whole set (ok if you only want the top 10/top 20, e.g.). The latter can be done using a priority queue to keep the best 10 or 20 so far, the former is hard to say as much about since sorting with multi-way comparators is much less well-studied. – kaya3 Feb 01 '20 at 17:38
1

Assuming that your music library has no order to it, merge sort is the best sorting algorithm to use. Adding elements while merge sort is ongoing is not so easy though.

I think your best bet is a depth limited search tree like the 2-3 tree or the red-black tree. Personally I would suggest the 2-3 tree as the red black is a variant of it with less per node complexity but a worse minimum depth bound.

Using this tree you can simply start adding songs to it according to the rules clearly described on Wikipedia and every song you have added will be in sorted order. This has the added benefit that when inserting a song it will be compared multiple times in a row and thus it will be fresh in your memory so you might not need to listen to it for every comparison.

This method sorts your songs one at a time so if a new song comes along that you want to rank right away you can just add it before the rest of the unsorted songs.

You will probably need to make a program to assist you with maintaining the order and the tree structure. The only manual way I can think of is to use nested folders as nodes which makes adding to and rearranging the tree doable. It does make querying a bit of a hassle though, depending on what you want to do.

Arnaldur
  • 67
  • 9
  • I like your answer, it seems well thought out. However it doesn't motivate very well as to why this would be the optimal solutions. Other answers have given calculations and/or references to back up the argument. It's not clear to me with this answer how many comparisons i would have to do. – Flip Feb 01 '20 at 13:57
  • You would have to do at most n log n comparisons. Although like I mentioned with it being fresh in your memory, it would probably be closer to (n log n)/2. – Arnaldur Feb 04 '20 at 16:29
1

A non-comparative sorting algorithm, like radix sort, can sort data with 0 comparisons! These are not as generic as comparative sorting algorithms like merge or insertion sort, but can get far better runtime if your data meets the necessary requirements.

Essentially, if you have knowledge about the distribution of your data, you can sort faster than O(n log n). For instance, if you are sorting n numbers, and know that they are integers between 1 and N, you can use counting sort to sort them in O(n + N). You can iteratively add elements for O(1) as well.

Applying this to your problem of ranking music is more challenging (songs are not integers), but you can you do a variation of bucket sort where you first bin your music into, say, 10% "tiers": top 0-10%, 10-20%, 20-30%, ..., 90-100% (i.e., the bottom). Then you can either recursively apply bucket sort to those (top 0-1%, 1-2%, etc.) or apply standard sorting algorithms. Eventually, you'll need to do a standard comparison sort. This approach, compared with only using comparison sort, will reduce the number of comparisons by a factor of log(n)/log(n/B), where B is the number of buckets. For 100 buckets and 10000 songs, this is a factor of 2 reduction.

An alternative, comparison-saving approach is to do insertion sort (for both initial sorting and later insertions) with a modified binary search: instead of setting the initial bounds of the binary serach at 0 and n, set them to values based on your own intuition of where you are certain it will end up, like 0 and n/10, if it's definitely in your top 10%. The more granularly you can do this, the fewer comparisons you will need.

Caveat: with both bucket sort and the modified binary search, if you are wrong, you will need to do additional comparisons to fix your mistake.

And one final word: this question assumes that there exists is a correct ranking and that it can be achieved via comparisons. If you have circular preferences, such as a > b, b > c, and c > a, a la rock-paper-scissors, then a ranking cannot be constructed. The algorithms will still complete, but the resulting list will be inconsistent.

David Slater
  • 396
  • 2
  • 6
  • Thanks for your answer. I think this answer is the best because it would take me less time to first rate songs into buckets (from 1 to 10). And after that refine sorting more if i would like that. After a while bucket sorting gets impractical because i can not evaluate the beauty of a song on a very fine scale to say something like "i rate this song 623 out of 1000". Then like you say i can switch to insertion sort if i want to refine more. If i then add a song later rated "5" i can use binary search in the "5 bucket" to place it accurately if i like. I just say this to recap in my own words. – Flip Feb 01 '20 at 14:09