7

Given 3 lists which are arbitrarily sorted by the same but unknown sort order. Is there an algorithm that merges these lists into one which is then still sorted by the same order?

Example:

List1: a b c f h

List2: b c e h

List3: c d e f

Assume these lists are sorted but the sort order used is not known. I want to combine these lists to a result that does not contain duplicates but still maintains the sort order: a b c d e f h

As said above: It is known, that the given lists are sorted but it is not known by which order, but the requirement is that the merged list is still sorted by the same (unknown) order.

In the example above, I know that the element "f" is positioned between "e" and "h" because from List1 I know that

"c" < "f" < "h",

from List2 I know that

"c" < "e" < "h"

and from List3 I know that

"e" < "f" and "c" < "e"

which combines to:

"c" < "e" < "f" < "h"

If the sort order cannot be determined by any of the given lists, it is permissible to just append the element to the end of the result list. Also, if the sort order of a sequence of elements cannot be determined it is permissible to insert them in any order into the list as long as they are in the right place (e.g. if I know that "b" and "c" must be inserted between "a" and "d" but I don't know if it should be a b c d or a c b d, then both are permissible.)

Of course this is just an example. The real lists are longer (but contain less than 100 elements), contain not single but multiple character elements and the sort order is not alphabetic. Also, I have got up to 5 lists.

I need to implement this algorithm in Delphi (and no: This is not homework but a real life problem), but I take an algorithm in an language provided it does not contain too much compiler magic or complex library functions.

Performance is not much of an issue because this is done once.

dummzeuch
  • 10,975
  • 4
  • 51
  • 158
  • I have racked my brain and tried do remember my graph theory at Uni for an algorithm that might help. To no avail so far. – dummzeuch Oct 10 '13 at 15:05
  • Not. Sure. Put all elements into one list and sort those along the knowledge retrieved from/ according to the sort order in list1, list2, list3, ... one list after the other. If the sort order is consistent across all 5 lists then it does not matter in all other cases you will run into an error. The order of an element in the original list1 for example is only the relative position to the successor. –  Oct 10 '13 at 15:27
  • I considered that, too, @Michael, but most sorting algorithms assume a total ordering of the elements — they don't do well when the comparison of two elements can yield inconsistent results. Given two elements that don't have a relative order defined, how do you decide which comes first in a way that's consistent with the other data? – Rob Kennedy Oct 10 '13 at 15:33
  • No need the compare since the knowledge of the sort order in list1 is already included in the destination list. –  Oct 10 '13 at 15:37
  • add) No need the compare since the knowledge of the sort order in list1 is already included in the destination list. @Rob Kennedy Does it matter. My idea was to 'sort' according to list one, then list 2, then list 3, ... There is no way to say an absolute position. You only know from list1 that this element's position is before it's neighbor and from list2 you know it more specifically, ... . Since there is no natural order what shall you do. I will have to rethink this for a while. –  Oct 10 '13 at 15:42
  • Why is not simply merging the lists being considered if they are all sorted? – Glenn1234 Oct 11 '13 at 01:12
  • how do you combine the lists "a b c f" and "a b c h" ? – Pieter B Oct 11 '13 at 06:54

5 Answers5

9

Your input lists define a partial order of your items. According to an answer at Math.SE, what you want is a topological sort. Algorithms are described on Wikipedia.

Community
  • 1
  • 1
Rob Kennedy
  • 161,384
  • 21
  • 275
  • 467
5

Nice question. Although topological sort may be the most recommended method, you would have to parse the input first to build up the dependencies list. I thought of a more directed approach, based on finding the items that occur in multiple lists for setting up the order definition.

I am unable to predict anything on time complexity, but since you do not care about performance and especially considering the total number of items being 500 at maximum, I think this algorithm should work nicely.

Algorithm

  • All lists are put together into a temporary list which then is naturally sorted in order to identify and sift out all duplicate items. Those duplicates, named Keys, constitute the sole definition of the final sort order.
  • The Key list is sorted in the input sorting order by comparing every two items: if the two Keys occur in the same input list, then the first of them in that list comes in front of the second in the output list too. If two keys do not occur simultaneously in any input list, then they are considered equal.
  • Subsequently, a loop cycles over the Keys.
  • On every cycle, within each input list each item between the previous Key and the current Key is added to the output list. A cycle ends with the addition of the current Key.

Implementation

type
  TSorterStringList = class(TStringList)
  protected
    Id: Integer;
    KeyId: Integer;
    function Current: String;
  public
    constructor Create;
  end;

  TSorterStringLists = class(TObjectList)
  private
    function GetItem(Index: Integer): TSorterStringList;
  public
    property Items[Index: Integer]: TSorterStringList read GetItem; default;
  end;

  TSorter = class(TObject)
  private
    FInput: TSorterStringLists;
    FKeys: TStringList;
    procedure GenerateKeys;
    function IsKey(const S: String): Boolean;
  public
    constructor Create;
    destructor Destroy; override;
    procedure Sort(Output: TStrings);
    property Input: TSorterStringLists read FInput;
  end;

{ TSorterStringList }

constructor TSorterStringList.Create;
begin
  inherited Create;
  KeyId := -1;
end;

function TSorterStringList.Current: String;
begin
  Result := Strings[Id];
end;

{ TSorterStringLists }

function TSorterStringLists.GetItem(Index: Integer): TSorterStringList;
begin
  if Index >= Count then
    Count := Index + 1;
  if inherited Items[Index] = nil then
    inherited Items[Index] := TSorterStringList.Create;
  Result := TSorterStringList(inherited Items[Index]);
end;

{ TSorter }

constructor TSorter.Create;
begin
  inherited Create;
  FInput := TSorterStringLists.Create(True);
  FKeys := TStringList.Create;
end;

destructor TSorter.Destroy;
begin
  FKeys.Free;
  FInput.Free;
  inherited Destroy;
end;

threadvar
  CurrentSorter: TSorter;

function CompareKeys(List: TStringList; Index1, Index2: Integer): Integer;
var
  Input: TSorterStringLists;
  I: Integer;
  J: Integer;
  K: Integer;
begin
  Result := 0;
  Input := CurrentSorter.Input;
  for I := 0 to Input.Count - 1 do
  begin
    J := Input[I].IndexOf(List[Index1]);
    K := Input[I].IndexOf(List[Index2]);
    if (J > - 1) and (K > -1) then
    begin
      Result := J - K;
      Break;
    end;
  end;
end;

procedure TSorter.GenerateKeys;
var
  All: TStringList;
  I: Integer;
begin
  All := TStringList.Create;
  try
    All.Sorted := True;
    All.Duplicates := dupAccept;
    for I := 0 to FInput.Count - 1 do
      All.AddStrings(FInput[I]);
    for I := 0 to All.Count - 2 do
      if (All[I] = All[I + 1]) then
        if (FKeys.Count = 0) or (FKeys[FKeys.Count - 1] <> All[I]) then
          FKeys.Add(All[I]);
  finally
    All.Free;
  end;
  CurrentSorter := Self;
  FKeys.CustomSort(CompareKeys);
end;

function TSorter.IsKey(const S: String): Boolean;
begin
  Result := FKeys.IndexOf(S) > -1;
end;

procedure TSorter.Sort(Output: TStrings);
var
  KeyId: Integer;
  I: Integer;
  List: TSorterStringList;
begin
  if FInput.Count = 0 then
    Exit;
  Output.BeginUpdate;
  try
    GenerateKeys;
    for KeyId := -1 to FKeys.Count - 1 do
    begin
      for I := 0 to FInput.Count - 1 do
      begin
        List := FInput[I];
        if List.KeyId <= KeyId then
          while (List.Id < List.Count) and not IsKey(List.Current) do
          begin
            Output.Add(List.Current);
            Inc(List.Id);
          end;
        while (List.Id < List.Count) and IsKey(List.Current) do
        begin
          List.KeyId := FKeys.IndexOf(List.Current);
          Inc(List.Id);
        end;
      end;
      if KeyId + 1 < FKeys.Count then
        Output.Add(FKeys[KeyId + 1]);
    end;
  finally
    Output.EndUpdate;
  end;
end;

Example usage

procedure TForm1.Button1Click(Sender: TObject);
var
  Sorter: TSorter;
begin
  Sorter := TSorter.Create;
  try
    Sorter.Input[0].CommaText := '1, 2, 4, 9, 10, 11, 22, 46, 48, 51, 70, 72';
    Sorter.Input[1].CommaText := '3, 9, 23, 43, 44, 45, 47, 48, 51, 71, 90, 91';
    Sorter.Input[2].CommaText := '0, 3, 4, 7, 8, 11, 23, 50, 51, 52, 55, 70';
    Sorter.Input[3].CommaText := '2, 6, 14, 15, 36, 37, 38, 39, 51, 65, 66, 77';
    Sorter.Input[4].CommaText := '5, 27, 120, 130';
    ListBox1.Items.Assign(Sorter.Input[0]);
    ListBox2.Items.Assign(Sorter.Input[1]);
    ListBox3.Items.Assign(Sorter.Input[2]);
    ListBox4.Items.Assign(Sorter.Input[3]);
    ListBox5.Items.Assign(Sorter.Input[4]);
    Sorter.Sort(ListBox6.Items);
    // Results in:
    // 1, 0, 5, 27, 120, 130, 3, 2, 6, 14, 15, 36, 37, 38, 39, 4, 7, 8, 9, 10,
    // 11, 22, 46, 23, 43, 44, 45, 47, 50, 48, 51, 71, 90, 91, 52, 55, 65, 66,
    // 77, 70, 72
  finally
    Sorter.Free;
  end;
end;
Community
  • 1
  • 1
NGLN
  • 43,011
  • 8
  • 105
  • 200
  • 3
    I haven't actually read and understood your code yet, but it worked with the test data I have thrown at it so far. – dummzeuch Oct 11 '13 at 10:25
2

So you have

List1: a b c f h
List2: b c e h
List3: c d e f

Go list by list and enter into a graph. So after the first list you have:

A -> B -> C -> F -> H

Then you start on list 2. B is already in there. Then you see that B connects to C which you already know. Then you know that C connects to E , which isn't in there yet so you now have:

A -> B -> C -> F -> H
          |          
          E

You then know that E connects to H so:

A -> B -> C -> F -> H
          |         ^ 
          E --------|

Then you go to list 3. You know that C is in there and it points to D:

          D
          ^
          |
A -> B -> C -> F -> H
          |         ^ 
          E --------|

Then you know that D points to E. Since C->E has the same ancestor as C -> D -> E, you can break the link from C-> E, thus you now have:

          D -> E ---|
          ^         |
          |         |
A -> B -> C -> F -> H

And finally you know E comes before F. Since you knew that E led directly to H before, and now there's another path to H from E (E->F->H) you know that F must be between E and H and you can drop the link from E -> H. Thus you now have:

          D -> E
          ^    |     
          |    |     
A -> B -> C -> F -> H

Which you then know can be shortened to

A -> B -> C -> D -> E -> F -> H

Now let's suppose you had ended up with something like:

  E -> T
  |    |
  A -> Z
  |    ^
  R -> W

You wouldn't have enough information to tell if E/T comes before R/W, but you know that both come before Z and after A. Thus, you'd just take one of the paths at random, then the next, etc, so you could end up with either A-E-T-R-W-Z or A-R-W-E-T-Z. You could even take one at random from each path, which would guarantee that those legs would still be sorted and maybe you'll get lucky that your merge was sorted as well. So you could have A-R-E-W-T-Z which still has E/T relatively sorted and R/W still relatively sorted or if you were to start with the E leg you'd get lucky and have A-E-R-T-W-Z

FuriousGeorge
  • 4,561
  • 5
  • 29
  • 52
1

Graph theory seems like a good first intuition.

You could build a directed graph where your list's elements are the vertices and you insert a directed edge from each list element to its successor. Then a node A is less-than another node B if and only if B can be reached from A by traversing the graph.

A cycle in the graph (A is less than B and B is less than A) indicates either a corruption of the input data or the presence of two equivalent elements with different names.

In the absence of cycles, merging under the given less-than relation should be straightforward: Repeatedly remove the nodes that cannot be reached by any other node from the graph and add them to your output list.

ComicSansMS
  • 51,484
  • 14
  • 155
  • 166
0

Can you use a hash table? Here's an algorithm for merging two lists like that.

T = new HashMap
for(i = 1 to length B)
  T.put(B[i],i)
N = array[length A]
for(i = 1 to length A){
  if(T containsKey A[i])
    N[i] = T.get(A[i])
  else
    N[i] = -1
}

R = array[length A + length B]
j = 1
k = 1
for(i = 1 to length A){
  if(N[i] = -1)
    R[j++] = N[i]
  else{
    while(k <= N[i])
      R[j++] = B[k++]
  }
}
while(k <= length B)
  R[j++] = B[k++]
return R[1 ... j-1]

The elements A[i] where N[i]>0 match elements of B, the others will be placed in a valid order. There might be a bug in there, but this is the general idea.

For merging three arrays, you could merge the first two, and then merge the third into the merged array. On edit, this last sentence is false, as pointed out by @RobKennedy. You can probably change the algorithm to handle three lists, but it's not quite as simple, and so you might want to go with topological sort instead.

mrip
  • 14,913
  • 4
  • 40
  • 58
  • Suppose the inputs are {a}, {b}, and {b,a}. If you do as you suggest in your final paragraph and merge the first two lists to create a temporary list that you then merge with the third one, you might get {a,b}, which contradicts the third input. You can't just apply the algorithm incrementally or you risk generating false constraints; you need to process all the input at once. – Rob Kennedy Oct 10 '13 at 17:00