Sorting a list to maximize similarity between neighboring items

Question

I have a list of keys:

['A', 'B', 'C']

For each of those keys there's a list of properties:

{
    'A': [2,3],
    'B': [1,2],
    'C': [4]
}

I wish to sort the list of labels such that neighboring labels share as many of the properties as possible.

In the above example A and B share the relation 2, so they should be next to each other - whereas C shares nothing with them, so it can go anywhere.

So the possible orders for this example would be as follows:

["A","B","C"] # acceptable
["A","C","B"] # NOT acceptable
["B","A","C"] # acceptable
["B","C","A"] # NOT acceptable
["C","A","B"] # acceptable
["C","B","A"] # acceptable

Buckets

Actually I would prefer this to be represented by putting them into "buckets":

[["A", "B"], ["C"]] # this can represent all four possible orders above.

However, this gets problematic if a label belongs to two different buckets:

{
    'A': [2,3],
    'B': [1,2],
    'C': [1,4]
}

How would I represent that?

I could put it like this:

[["A", "B"], ["C", "B"]]

But then I need another processing step to turn the list of buckets into the final representation:

["A", "B", "C"]

And above that there could be recursively nested buckets:

[[["A","B"], ["C"]], ["D"]]

And then these could overlap:

[[["A","B"], ["C"]], ["A","D"]]

Quality

The "closeness", i.e. quality of a solution is defined as the sum of the intersection of relations between neighbors (the higher the quality the better):

def measurequality(result,mapping):
    lastKey = None
    quality = 0
    for key in result:
        if lastKey is None:
            lastKey = key
            continue
        quality += len(set(mapping[key]).intersection(mapping[lastKey]))
        lastKey = key
    return quality

# Example determining that the solution ['A', 'B', 'C'] has quality 1:
#measurequality(['A', 'B', 'C'],
#    {
#        'A': [2,3],
#        'B': [1,2],
#        'C': [4]
#    })

Brute-Forcing

Brute-forcing does not constitute a solution (in practice the list contains on the order of several thousand elements - though, if anyone got a brute-forcing approach that is better than O(n²)...).

However, using brute-forcing to create additional test cases is possible:

produce a list L of n items ['A','B','C',...]
produce for each item a dictionary R of relations (up to n random numbers between 0 and n should be sufficient).
produce all possible permutations of L and feed them together with R into measurequality() and keep those with maximal return value (might not be unique).

Code for creating random testcases to test the implementation:

import string
import random

def randomtestcase(n):
    keys=list(string.ascii_uppercase[0:n])

    minq = 0
    maxq = 0
    while minq == maxq:
        items={}
        for key in keys:
            items[key] = random.sample(range(1,10),int(random.random()*10))

        minq = n*n
        minl = list(keys)
        maxq = 0
        maxl = list(keys)
        for _ in range(0, 1000): # TODO: explicitly construct all possible permutations of keys.
            random.shuffle(keys)
            q = measurequality(keys,items)
            if q < minq:
                minq = q
                minl = list(keys)
            if maxq < q:
                maxq = q
                maxl = list(keys)

    return ( items, minl, maxq )

( items, keys, quality ) = randomtestcase(5)

sortedkeys = dosomething( keys, items )
actualquality = measurequality( sortedkeys, items )
if actualquality < quality:
    print('Suboptimal: quality {0} < {1}'.format(actualquality,quality))

Attempt

One of the many "solutions" that didn't work (very broken, this one doesn't have the selection of initial element / choice between prepending and appending to the result list that I had in others):

def dosomething(keys,items):
    result = []
    todo = list(keys)
    result.append(todo.pop())
    while any(todo):
        lastItems = set(items[result[-1]])
        bestScore = None
        bestKey = None
        for key in todo:
            score = set(items[key]).intersection(lastItems)
            if bestScore is None or bestScore < score:
                bestScore = score
                bestKey = key
        todo.remove(bestKey)
        result.append(bestKey)
    return result

Examples

(Also check out the example generator in the section Brute-Forcing above.)

Testing code trying some examples:

def test(description,acceptable,keys,arguments):
    actual = dosomething(keys,arguments)
    if "".join(actual) in acceptable:
        return 0
    print("\n[{0}] {1}".format("".join(keys),description))
    print("Expected: {0}\nBut was: {1}".format(acceptable,actual))
    print("Quality of result: {0}".format(measurequality(actual,arguments)))
    print("Quality of expected: {0}".format([measurequality(a,arguments) for a in acceptable]))
    return 1

print("EXAMPLES")
failures = 0

# Need to try each possible ordering of letters to ensure that the order of keys
# wasn't accidentially already a valid ordering.
for keys in [
        ["A","B","C"],
        ["A","C","B"],
        ["B","A","C"],
        ["B","C","A"],
        ["C","A","B"],
        ["C","B","A"]
    ]:
    failures += test(
        "1. A and B both have 2, C doesn't, so C can go first or last but not in between.",
        ["ABC", "BAC", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [4]
        })

    failures += test(
        "2. They all have 2, so they can show up in any order.",
        ["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [2]
        })

    failures += test(
        "3. A and B share 2, B and C share 1, so B must be in the middle.",
        ["ABC", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [1]
        })

    failures += test(
        "4. Each shares something with each other, creating a cycle, so they can show up in any order.",
        ["ABC", "ACB", "BAC", "BCA", "CAB", "CBA"],
        keys,
        {
            "A": [2,3],
            "B": [1,2],
            "C": [1,3]
        })

if 0 < failures:
    print("{0} FAILURES".format(failures))

Precedence

As it was asked: the numbers used for the relations aren't in an order of precedence. An order of precedence exists, but it's a partial order and not the one of the numbers. I just didn't mention it because it makes the problem harder.

So given this example:

{
    'A': [2,3],
    'B': [1,2],
    'C': [4]
}

Might be replaced by the following (using letters instead of numbers and adding precedence information):

{
    'A': [('Q',7),('R',5)],
    'B': [('P',6),('Q',6)],
    'C': [('S',5)]
}

Note that

The precedence is meaningful only within a list, not across lists.
The precedence of shared relations might be different between two lists.
Within a list there might be the same precedence several times.

Since you mentioned "possible orders", this sounds more a problem of grouping, instead of sorting? Are elements in the list expected to be unique? e.g. will `['A', 'B', 'A', 'C', 'D', 'B']` be a possible case, and how would that manifest as a result? You can only apply a comparison score against pairs, so how would pairs be identified? `['A', 'B', 'C', 'D']` would create 6 possible pairs of `[('A', 'B'), ('A', 'C'), ('A', 'D'), ('B', 'C'), ('B', 'D'), ('C', 'D')]`, and which type of pairs are to be prioritized first? The requirement feels a bit unclear to me. — r.ook, May 11 '20 at 22:07
Is the list short enough that brute-forcing all permutations is acceptable? — Samwise, May 11 '20 at 22:07
Ultimately I want the sorted list. But it would be really useful to have the bucket list as an intermediate step because it splits the problem. — user66554, May 11 '20 at 22:10
this sounds like you're trying to solve the Traveling Salesman problem (https://en.wikipedia.org/wiki/Travelling_salesman_problem), where your "distance" between points for your problem is some function of how many "properties" any two items share — CrepeGoat, May 11 '20 at 22:12
When brute forcing the buckets would of course be a complication. But while brute forcing works to check whether the algorithm produces the correct result, the list size in production is several thousand elements (each of which already is a sublist for performance reasons). — user66554, May 11 '20 at 22:13
The list needs to be unique, and the comparison is between neighbors. So in [A,B,C,D] comparisons are done AB, BC and CD. If all pairs would be considered then there wouldn't be a point, the score would always be the same. — user66554, May 11 '20 at 22:23
I was afraid it might end up NP. The original problem should have been solvable in n², but nobody could tell me how, so I had to "reduce" it to this. — user66554, May 11 '20 at 22:25
It's a mix between sorting and grouping - that's why I'd like an intermediate "bucket" result: putting them into buckets is the grouping step. Then sorting is applied - both sorting the buckets and sorting the contents of each bucket. The point is that the contents of the buckets may not mix. — user66554, May 11 '20 at 22:28
So for a list of `ABCD`, there wouldn't be a valid comparison between `BD`, `AD`, and `AC`, is that so? So really you are sorting in pairs based on existing order? — r.ook, May 11 '20 at 22:29
@user66554 two things: what's the "original problem"? and how do you know it *should* be solvable in n^2; is this a homework problem? — CrepeGoat, May 11 '20 at 22:35
The original problem is rather complicated - this sub-problem is about grouping nodes of a DAG based on common ancestors and descendants. If it were a homework problem then someone would know the solution, so it would be way less frustrating. — user66554, May 11 '20 at 22:48
oh these groups are derived from a DAG? that's actually not necessarily a negligible detail; it means the groups have some underlying order to them that you wouldn't get from just randomly generated groups. can you go into more detail on the DAG bit? for example one question: do the numbers ("properties") represent the nodes that make up a given node's ancestry (either strictly parents, or strictly children, or all of the above)? or do they represent something else? — CrepeGoat, May 12 '20 at 03:32
It's descendants who have ancestors who aren't descendants. Though it might be anything else, that's the point of separating out the problem. And even if not, the edges in a graph can be chosen such that any arbitrary set of numbers results. — user66554, May 12 '20 at 06:48
for 'bucket' manipulations check out the `more_itertools` library, particularly `flatten()` and `collapse()` methods. I think there is a good case for learning about and using iterators and generators for solving this very difficult problem. Otherwise your memory usage and execution time quickly ramp up — Nic, Jun 20 '20 at 18:56

Nic · Accepted Answer · 2020-06-10T23:20:22.803

This is a Travelling Salesman Problem, a notoriously hard problem to solve optimally. The code presented solves for 10,000 nodes with simple interconnections (i.e. one or two relations each) in about 15 minutes. It performs less well for sequences that are more richly interconnected. This is explored in the test results below.

The idea of precedence, mentioned by the OP, is not explored.

The presented code consists of a heuristic solution, a brute-force solution that is optimal but not practical for large node_sets, and some simple but scalable test data generators, some with known optimal solutions. The heuristic finds optimal solutions for the OP's 'ABC' example, my own 8-item node_set, and for scalable test data for which optimal solutions are known.

If the performance is not good enough, at least it is a bit of a first effort and has the beginnings of a test 'workshop' to improve the heuristic.

Test Results

>>> python sortbylinks.py

===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1

===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6

Short, relatively trivial cases appear to be no problem.

===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991

===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991

The "Quality <1000, cycling subsequence" (sequence length 501) is interesting. By grouping nodes with {0, 1} relation sets the quality score can be nearly doubled. The heuristic finds this optimal sequence. Quality 1000 is not quite possible because these double-linked groups need attaching to each other via a single-linked node every so often (e.g. {'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}).

Performance is still good for this test data with few relations per node.

"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400

===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400

One of the difficulties with Travelling Salesman Problems (TSPs) is knowing when you have an optimal solution. The heuristic doesn't seem to converge any faster even from a near-optimal or optimal start.

===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000

When there are very small numbers of relations, even if there are many nodes, performance is pretty good and the results may be close to optimal.

===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?

===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?

This is more like the data generated by the OP, and also more like the classical Travelling Salesman Problem (TSP) where you have a set of distances between each city pair (for 'city' read 'node') and nodes are typically richly connected to each other. In this case the links between nodes is partial-- there is no guarantee of a link between any 2 nodes.

The heuristic's time performance is much worse in cases like this. There are between 0 and n random relations for each node, for n nodes. This is likely to mean many more swap combinations yield improved quality, swaps and quality checks are more expensive in themselves and many more passes will be needed before the heuristic converges on its best result. This may mean O(n^3) in the worst case.

Performance degrades as the number of nodes and relations increases (note the difference between n=200-- 3 minutes-- and n=500-- 70 minutes.) So currently the heuristic may not be practical for several thousand richly-interconnected nodes.

In addition, the quality of the result for this test can't be known precisely because a brute-force solution is not computationally feasible. 6861 / 200 = 34.3 and 41999 / 500 = 84.0 average connections between node pairs doesn't look too far off.

Code for the Heuristic and Brute Force Solvers

import sys
from collections import deque
import itertools
import random
import datetime

# TODO: in-place swapping? (avoid creating copies of sequences)


def timing(f):
    """
    decorator for displaying execution time for a method
    """
    def wrap(*args):
        start = datetime.datetime.now()
        f_return_value = f(*args)
        end = datetime.datetime.now()
        print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
        return f_return_value
    return wrap


def flatten(a):
    # e.g. [a, [b, c], d] -> [a, b, c, d]
    return itertools.chain.from_iterable(a)


class LinkAnalysis:
    def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
        """
        :param node_set: node_ids and their relation sets to be arranged in sequence
        :param max_ram: estimated maximum RAM to use
        :param generate_seeds: if true, attempt to generate some initial candidates based on sorting
        """
        self.node_set = node_set
        self.candidates = {}
        self.generate_seeds = generate_seeds
        self.seeds = {}
        self.relations = []
        # balance performance and RAM using regular 'weeding'
        candidate_size = sys.getsizeof(deque(self.node_set.keys()))
        self.weed_interval = max_ram // candidate_size

    def create_initial_candidates(self):
        print('Working...')
        self.generate_seed_from_presented_sequence()
        if self.generate_seeds:
            self.generate_seed_candidates()

    def generate_seed_from_presented_sequence(self):
        """
        add initially presented order of nodes as one of the seed candidates
        this is worthwhile because the initial order may be close to optimal
        """
        presented_sequence = self.presented_sequence()
        self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)

    def presented_sequence(self) -> list:
        return list(self.node_set.keys())  # relies on Python 3.6+ to preserve key order in dicts

    def generate_seed_candidates(self):
        initial_sequence = self.presented_sequence()
        # get relations from the original data structure
        relations = sorted(set(flatten(self.node_set.values())))
        # sort by lowest precedence relation first
        print('...choosing seed candidates')
        for relation in reversed(relations):
            # use true-false ordering: in Python, True > False
            initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
            sq = self.quality(initial_sequence)
            self.seeds[tuple(initial_sequence)] = sq

    def quality(self, sequence):
        """
        calculate quality of full sequence
        :param sequence:
        :return: quality score (int)
        """
        pairs = zip(sequence[:-1], sequence[1:])
        scores = [len(self.node_set[a].intersection(self.node_set[b]))
                  for a, b in pairs]
        return sum(scores)

    def brute_force_candidates(self, sequence):
        for sequence in itertools.permutations(sequence):
            yield sequence, self.quality(sequence)

    def heuristic_candidates(self, seed_sequence):
        # look for solutions with higher quality scores by swapping elements
        # start with large distances between elements
        # then reduce by power of 2 until swapping next-door neighbours
        max_distance = len(seed_sequence) // 2
        max_pow2 = int(pow(max_distance, 1/2))
        distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
        for distance in distances:
            yield from self.seed_and_variations(seed_sequence, distance)
        # seed candidate may be better than its derived sequences -- include it as a candidate
        yield seed_sequence, self.quality(seed_sequence)

    def seed_and_variations(self, seed_sequence, distance=1):
        # swap elements at a distance, starting from beginning and end of the
        # sequence in seed_sequence
        candidate_count = 0
        for pos1 in range(len(seed_sequence) - distance):
            pos2 = pos1 + distance
            q = self.quality(seed_sequence)
            # from beginning of sequence
            yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
            # from end of sequence
            yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
            candidate_count += 2
            if candidate_count > self.weed_interval:
                self.weed()
                candidate_count = 0

    def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
        """
        swap and return quality (which can easily be calculated from present quality
        :param sequence: as for swap
        :param pos1: as for swap
        :param pos2: as for swap
        :param preswap_sequence_q: quality of pre-swapped sequence
        :return: swapped sequence, quality of swapped sequence
        """
        initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
        swapped_sequence = self.swap(sequence, pos1, pos2)
        swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
        qdelta = swapped_node_q - initial_node_q
        swapped_sequence_q = preswap_sequence_q + qdelta
        return swapped_sequence, swapped_sequence_q

    def swap(self, sequence, node_pos1: int, node_pos2: int):
        """
        deques perform better than lists for swapping elements in a long sequence
        :param sequence-- sequence on which to perform the element swap
        :param node_pos1-- zero-based position of first element
        :param pos2--: zero-based position of second element
        >>> swap(('A', 'B', 'C'), 0, 1)
        ('B', 'A', 'C')
        """
        if type(sequence) is tuple:
            # sequence is a candidate (which are dict keys and hence tuples)
            # needs converting to a list for swap processing
            sequence = list(sequence)
        if node_pos1 == node_pos2:
            return sequence
        tmp = sequence[node_pos1]
        sequence[node_pos1] = sequence[node_pos2]
        sequence[node_pos2] = tmp
        return sequence

    def node_quality(self, sequence, pos):
        if pos < 0:
            pos = len(sequence) + pos
        no_of_links = 0
        middle_node_relations = self.node_set[sequence[pos]]
        if pos > 0:
            left_node_relations = self.node_set[sequence[pos - 1]]
            no_of_links += len(left_node_relations.intersection(middle_node_relations))
        if pos < len(sequence) - 1:
            right_node_relations = self.node_set[sequence[pos + 1]]
            no_of_links += len(middle_node_relations.intersection(right_node_relations))
        return no_of_links

    @timing
    def optimise_order(self, selection_strategy):
        top_score = 0
        new_top_score = True
        self.candidates.update(self.seeds)
        while new_top_score:
            top_score = max(self.candidates.values())
            new_top_score = False
            initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
            for initial_candidate in initial_candidates:
                for candidate, q in selection_strategy(initial_candidate):
                    if q > top_score:
                        new_top_score = True
                        top_score = q
                        self.candidates[tuple(candidate)] = q
                self.weed()
        print(f"Number of candidates with top score: {len(list(self.candidates))}")
        print(f"{list(self.candidates)[:3]}, ...and more")
        print(f"Top score: {top_score}")

    def weed(self):
        # retain only top-scoring candidates
        top_score = max(self.candidates.values())
        low_scorers = {k for k, v in self.candidates.items() if v < top_score}
        for low_scorer in low_scorers:
            del self.candidates[low_scorer]

Code Glossary

node_set: a set of labelled nodes of the form 'unique_node_id': {relation1, relation2, ..., relationN}. The set of relations for each node can contain either no relations or an arbitrary number.

node: a key-value pair consisting of a node_id (key) and set of relations (value)

relation: as used by the OP, this is a number. If two nodes both share relation 1 and they are neighbours in the sequence, it adds 1 to the quality of the sequence.

sequence: an ordered set of node ids (e.g. ['A', 'B', 'C'] that is associated with a quality score. The quality score is the sum of shared relations between nodes in the sequence. The output of the heuristic is the sequence or sequences with the highest quality score.

candidate: a sequence that is currently being investigated to see if it is of high quality.

Method

generate seed sequences by stable sorting on the presence or absence of each relation in a linked item
The initially presented order is also one of the seed sequences in case it is close to optimal
For each seed sequence, pairs of nodes are swapped looking for a higher quality score
Execute a "round" for each seed sequence. A round is a shellsort-like pass over the sequence, swapping pairs of nodes, at first at a distance, then narrowing the distance until there is a distance of 1 (swapping immediate neighbours.) Keep only those sequences whose quality is more than the current top quality score
If a new highest quality score was found in this round, weed out all but top-score candidates and repeat 4 using top scorers as seeds. Otherwise exit.

Tests and Test Data Generators

The heuristic has been tested using small node_sets, scaled data of a few hundred up to 10,000 nodes with very simple relationships, and a randomised, richly interconnected node_set more like the OP's test data generator. Perfect single-linked sequences, linked cycles (small subsequences that link within themselves, and to each other) and shuffling have been useful to pick up and fix weaknesses.

ABC_links = {
    'A': {2, 3},
    'B': {1, 2},
    'C': {4}
}

nick_links = {
    'B': {1, 2, 4},
    'C': {4},
    'A': {2, 3},
    'D': {4},
    'E': {3},
    'F': {5, 6},
    'G': {2, 4},
    'H': {1},
}

unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)


def many_nodes_many_relations(node_count):
    # data set with n nodes and random 0..n relations as per OP's requirement
    relation_range = list(range(node_count))
    relation_set = (
        set(random.choices(relation_range, k=random.randint(0, node_count)))
        for _ in range(sys.maxsize)
    )
    return scaled_data(node_count, relation_set)


def scaled_data(no_of_items, link_sequence_model):
    uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    # unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
    item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
    # only use a copy of the original link sequence model-- otherwise the model could be exhausted
    # or start mid-cycle
    # https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator
    link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
    return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}


def shuffled(items_list):
    """relies on Python 3.6+ dictionary insertion-ordered keys"""
    shuffled_keys = list(items_list.keys())
    random.shuffle(shuffled_keys)
    return {k: items_list[k] for k in shuffled_keys}


cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)

linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)

large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)


tests = [
    ('ABC', 1, ABC_links, True),
    ('Nick 8-digit', 6, nick_links, True),
    # ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
    # ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
    ('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
    # ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
    # ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
    # ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
    # ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
    la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
    seq_length = len(list(la.node_set.keys()))
    print()
    print('===============================')
    print(f'"{title}" (sequence length {seq_length})')
    print('===============================')
    la.create_initial_candidates()
    print('Finding heuristic candidates...')
    la.optimise_order(la.heuristic_candidates)
    print(f'Optimal quality: {expected_quality}')
    # print('Brute Force working...')
    # la.optimise_order(la.brute_force_candidates)

Performance

The heuristic is more 'practical' than the brute force solution because it leaves out many possible combinations. It may be that a lower-quality sequence produced by an element swap is actually one step away from a much higher-quality score after one more swap, but such a case might be weeded out before it could be tested.

The heuristic appears to find optimal results for single-linked sequences or cyclical sequences linked head to tail. These have a known optimal solution and the heuristic finds that solution and they may be less complex and problematic than real data.

A big improvement came with the introduction of an "incremental" quality calculation which can quickly calculate the quality difference a two-element swap makes without recomputing the quality score for the entire sequence.

The heuristic is unsatisfactory: if your result `['B', 'H', 'A', 'G', 'E', 'C', 'D', 'F']` is fed to `measurequality` it returns `3`, but the result `['F', 'H', 'C', 'D', 'G', 'B', 'A', 'E']` gets `6`. (I have to believe `measurequality` here, I tried checking it by hand but failed.) — user66554, May 13 '20 at 08:22
Hmm interesting. Some of the intermediate results have quality 4. Back to the drawing board :) — Nic, May 13 '20 at 20:16
The heuristic now finds optimal solutions for my test data and yours. There is a test data generator for testing the heuristic on larger numbers of items. — Nic, May 14 '20 at 20:59
@user66554 some improvements just added, including some better test data. If any of these responses, mine or anyone else's, seem a genuine answer to you, please mark the preferred answer — Nic, May 27 '20 at 21:07

score 0 · Answer 2 · answered May 12 '20 at 19:09

I was tinkering with your test program and came up with this solution, which gives me 0 failures. Feels like a heuristic though, it needs definitely more testing and test cases. The function assumes that the keys are unique, so no ['A', 'A', 'B', ...] lists and in the arguments dictionary are all elements present:

def dosomething(_, arguments):

    m = {}
    for k, v in arguments.items():
        for i in v:
            m.setdefault(i, []).append(k)

    out, seen = [], set()
    for _, values in sorted(m.items(), key=lambda k: -len(k[1])):
        for v in values:
            if v not in seen:
                out.append(v)
                seen.add(v)

    return out

I added information about how to produce additional test cases. I just tested your answer with Nick's example like I did for his solution, and it yielded `4` instead of `6`. A heuristic considering only `len` on a single item won't work, `intersection` (or something similar) will have to show up somewhere. — user66554, May 13 '20 at 08:28

Cireo · Answer 3 · 2020-06-01T04:35:43.393

0

EDIT: Misread Quality function, this maps to separable traveling salesman problems

For N nodes, P properties, and T total properties across all nodes, this should be able to be solved in O(N + P + T) or better, depending on the topology of the data.

Lets convert your problem to a graph, where the "distance" between any two nodes is -(number of shared properties). Nodes with no connections would be left unlinked. This will take at least O(T) to create the graph, and perhaps another O(N + P) to segment.

Your "sort order" is then translated to a "path" through the nodes. In particular, you want the shortest path.

Additionally, you will be able to apply several translations to improve the performance and usability of generic algorithms:

Segment graph into disconnected chunks and solve each of them independently
Renormalize all the values to start at 1..N instead of -N..-1 (per graph, but doesn't really matter, could add |number of properties| instead).

https://en.wikipedia.org/wiki/Component_(graph_theory)#Algorithms

It is straightforward to compute the components of a graph in linear time (in terms of the numbers of the vertices and edges of the graph).

https://en.wikipedia.org/wiki/Shortest_path_problem#Undirected_graphs

Weights  Time complexity   Author
ℝ+       O(V2)             Dijkstra 1959
ℝ+       O((E + V) log V)  Johnson 1977 (binary heap)
ℝ+       O(E + V log V)    Fredman & Tarjan 1984 (Fibonacci heap)
ℕ        O(E)              Thorup 1999 (requires constant-time multiplication).

edited Jun 01 '20 at 04:35

answered May 14 '20 at 21:32

Cireo

4,197
1
19
24

Is this a graph problem or something more like scheduling? There is a linear sequence to optimise using the resources of each node (the relations). Notice that the nodes that have no link to other nodes in the "graph", that are not in the graph, are also a part of the solution. How does a shortest-path algorithm help you with that? – Nic May 15 '20 at 07:31
Separated nodes or entire subgraphs (which there may not be many of in real data) can be reassembled in arbitrary order after being solved individually. By construction anything that would matter if they are next to has already been accounted for. – Cireo May 15 '20 at 08:43
ok I'm not a specialist in this area, but it seems to me this problem is more like the Travelling Salesman problem where the higher the 'quality' of the connection between adjacent vertices the closer they are to each other. And 'The travelling salesman problem is the problem of finding the shortest path that goes through every vertex exactly once, and returns to the start'. Is this a mistaken view? – Nic May 29 '20 at 06:19
Sure, traveling salesman is a superset of this problem (it is a superset of "find the max of a list" too), but it is vastly more complex. Nodes can't be "more adjacent", they are just connected or not, which drastically simplifies the problem. – Cireo May 29 '20 at 08:05
Surely vertices are "more adjacent" if they share a larger number of relations with their neighbour (see the Quality function in the OP's post.) The number of connecting relations is an inverse of the "distance" in the travelling salesman problem. The more relations, the "closer" they are to each other. If this is overkill or superset, I would be very pleased to understand why – Nic Jun 01 '20 at 02:37
1

Ah, I either read an old edit or misread the quality function - you are correct. This is an exact mapping of traveling salesman (after the graph separation). – Cireo Jun 01 '20 at 04:34