This is a Travelling Salesman Problem, a notoriously hard problem to solve optimally. The code presented solves for 10,000 nodes with simple interconnections (i.e. one or two relations each) in about 15 minutes. It performs less well for sequences that are more richly interconnected. This is explored in the test results below.
The idea of precedence, mentioned by the OP, is not explored.
The presented code consists of a heuristic solution, a brute-force solution that is optimal but not practical for large node_set
s, and some simple but scalable test data generators, some with known optimal solutions. The heuristic finds optimal solutions for the OP's 'ABC' example, my own 8-item node_set
, and for scalable test data for which optimal solutions are known.
If the performance is not good enough, at least it is a bit of a first effort and has the beginnings of a test 'workshop' to improve the heuristic.
Test Results
>>> python sortbylinks.py
===============================
"ABC" (sequence length 3)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('A', 'B', 'C'), ('C', 'A', 'B'), ('B', 'A', 'C')], ...and more
Top score: 1
"optimise_order" function took 0.0s
Optimal quality: 1
===============================
"Nick 8-digit" (sequence length 8)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('A', 'E', 'F', 'C', 'D', 'G', 'B', 'H')], ...and more
Top score: 6
"optimise_order" function took 0.0s
Optimal quality: 6
Short, relatively trivial cases appear to be no problem.
===============================
"Quality <1000, cycling subsequence, small number of relations" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AAAC', 'AAAL', 'AAAU', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
===============================
"Quality <1000, cycling subsequence, small number of relations, shuffled" (sequence length 501)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 3
[('AADF', 'AAKM', 'AAOZ', ...), ...], ...and more
Top score: 991
"optimise_order" function took 2.0s
Optimal quality: 991
The "Quality <1000, cycling subsequence" (sequence length 501)
is interesting. By grouping nodes with {0, 1}
relation sets the quality score can be nearly doubled. The heuristic finds this optimal sequence. Quality 1000 is not quite possible because these double-linked groups need attaching to each other via a single-linked node every so often (e.g. {'AA': {0, 1}, 'AB': {0, 1}, ..., 'AZ': {0, 1}, <...single link here...> 'BA': {1, 2}, 'BB': {1, 2}, ...}
).
Performance is still good for this test data with few relations per node.
"Quality 400, single unbroken chain, initial solution is optimal" (sequence length 401)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
===============================
"Quality 400, single unbroken chain, shuffled" (sequence length 401)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAAA', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 400
"optimise_order" function took 0.0s
Optimal quality: 400
One of the difficulties with Travelling Salesman Problems (TSPs) is knowing when you have an optimal solution. The heuristic doesn't seem to converge any faster even from a near-optimal or optimal start.
===============================
"10,000 items, single unbroken chain, initial order is optimal" (sequence length 10001)
===============================
Working...
Finding heuristic candidates...
Number of candidates with top score: 1
[('AOUQ', 'AAAB', 'AAAC', ...), ...], ...and more
Top score: 10002
"optimise_order" function took 947.0s
Optimal quality: 10000
When there are very small numbers of relations, even if there are many nodes, performance is pretty good and the results may be close to optimal.
===============================
"Many random relations per node (0..n), n=200" (sequence length 200)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAEO', 'AAHC', 'AAHQ', ...), ...], ...and more
Top score: 6861
"optimise_order" function took 94.0s
Optimal quality: ?
===============================
"Many random relations per node (0..n), n=500" (sequence length 500)
===============================
Working...
...choosing seed candidates
Finding heuristic candidates...
Number of candidates with top score: 1
[('AAJT', 'AAHU', 'AABT', ...), ...], ...and more
Top score: 41999
"optimise_order" function took 4202.0s
Optimal quality: ?
This is more like the data generated by the OP, and also more like the classical Travelling Salesman Problem (TSP) where you have a set of distances between each city pair (for 'city' read 'node') and nodes are typically richly connected to each other. In this case the links between nodes is partial-- there is no guarantee of a link between any 2 nodes.
The heuristic's time performance is much worse in cases like this. There are between 0 and n
random relations for each node, for n
nodes. This is likely to mean many more swap combinations yield improved quality, swaps and quality checks are more expensive in themselves and many more passes will be needed before the heuristic converges on its best result. This may mean O(n^3) in the worst case.
Performance degrades as the number of nodes and relations increases (note the difference between n=200
-- 3 minutes-- and n=500
-- 70 minutes.) So currently the heuristic may not be practical for several thousand richly-interconnected nodes.
In addition, the quality of the result for this test can't be known precisely because a brute-force solution is not computationally feasible. 6861 / 200 = 34.3
and 41999 / 500 = 84.0
average connections between node pairs doesn't look too far off.
Code for the Heuristic and Brute Force Solvers
import sys
from collections import deque
import itertools
import random
import datetime
# TODO: in-place swapping? (avoid creating copies of sequences)
def timing(f):
"""
decorator for displaying execution time for a method
"""
def wrap(*args):
start = datetime.datetime.now()
f_return_value = f(*args)
end = datetime.datetime.now()
print('"{:s}" function took {:.1f}s'.format(f.__name__, (end-start).seconds))
return f_return_value
return wrap
def flatten(a):
# e.g. [a, [b, c], d] -> [a, b, c, d]
return itertools.chain.from_iterable(a)
class LinkAnalysis:
def __init__(self, node_set, max_ram=100_000_000, generate_seeds=True):
"""
:param node_set: node_ids and their relation sets to be arranged in sequence
:param max_ram: estimated maximum RAM to use
:param generate_seeds: if true, attempt to generate some initial candidates based on sorting
"""
self.node_set = node_set
self.candidates = {}
self.generate_seeds = generate_seeds
self.seeds = {}
self.relations = []
# balance performance and RAM using regular 'weeding'
candidate_size = sys.getsizeof(deque(self.node_set.keys()))
self.weed_interval = max_ram // candidate_size
def create_initial_candidates(self):
print('Working...')
self.generate_seed_from_presented_sequence()
if self.generate_seeds:
self.generate_seed_candidates()
def generate_seed_from_presented_sequence(self):
"""
add initially presented order of nodes as one of the seed candidates
this is worthwhile because the initial order may be close to optimal
"""
presented_sequence = self.presented_sequence()
self.seeds[tuple(self.presented_sequence())] = self.quality(presented_sequence)
def presented_sequence(self) -> list:
return list(self.node_set.keys()) # relies on Python 3.6+ to preserve key order in dicts
def generate_seed_candidates(self):
initial_sequence = self.presented_sequence()
# get relations from the original data structure
relations = sorted(set(flatten(self.node_set.values())))
# sort by lowest precedence relation first
print('...choosing seed candidates')
for relation in reversed(relations):
# use true-false ordering: in Python, True > False
initial_sequence.sort(key=lambda sortkey: not relation in self.node_set[sortkey])
sq = self.quality(initial_sequence)
self.seeds[tuple(initial_sequence)] = sq
def quality(self, sequence):
"""
calculate quality of full sequence
:param sequence:
:return: quality score (int)
"""
pairs = zip(sequence[:-1], sequence[1:])
scores = [len(self.node_set[a].intersection(self.node_set[b]))
for a, b in pairs]
return sum(scores)
def brute_force_candidates(self, sequence):
for sequence in itertools.permutations(sequence):
yield sequence, self.quality(sequence)
def heuristic_candidates(self, seed_sequence):
# look for solutions with higher quality scores by swapping elements
# start with large distances between elements
# then reduce by power of 2 until swapping next-door neighbours
max_distance = len(seed_sequence) // 2
max_pow2 = int(pow(max_distance, 1/2))
distances = [int(pow(2, r)) for r in reversed(range(max_pow2 + 1))]
for distance in distances:
yield from self.seed_and_variations(seed_sequence, distance)
# seed candidate may be better than its derived sequences -- include it as a candidate
yield seed_sequence, self.quality(seed_sequence)
def seed_and_variations(self, seed_sequence, distance=1):
# swap elements at a distance, starting from beginning and end of the
# sequence in seed_sequence
candidate_count = 0
for pos1 in range(len(seed_sequence) - distance):
pos2 = pos1 + distance
q = self.quality(seed_sequence)
# from beginning of sequence
yield self.swap_and_quality(seed_sequence, q, pos1, pos2)
# from end of sequence
yield self.swap_and_quality(seed_sequence, q, -pos1, -pos2)
candidate_count += 2
if candidate_count > self.weed_interval:
self.weed()
candidate_count = 0
def swap_and_quality(self, sequence, preswap_sequence_q: int, pos1: int, pos2: int) -> (tuple, int):
"""
swap and return quality (which can easily be calculated from present quality
:param sequence: as for swap
:param pos1: as for swap
:param pos2: as for swap
:param preswap_sequence_q: quality of pre-swapped sequence
:return: swapped sequence, quality of swapped sequence
"""
initial_node_q = sum(self.node_quality(sequence, pos) for pos in [pos1, pos2])
swapped_sequence = self.swap(sequence, pos1, pos2)
swapped_node_q = sum(self.node_quality(swapped_sequence, pos) for pos in [pos1, pos2])
qdelta = swapped_node_q - initial_node_q
swapped_sequence_q = preswap_sequence_q + qdelta
return swapped_sequence, swapped_sequence_q
def swap(self, sequence, node_pos1: int, node_pos2: int):
"""
deques perform better than lists for swapping elements in a long sequence
:param sequence-- sequence on which to perform the element swap
:param node_pos1-- zero-based position of first element
:param pos2--: zero-based position of second element
>>> swap(('A', 'B', 'C'), 0, 1)
('B', 'A', 'C')
"""
if type(sequence) is tuple:
# sequence is a candidate (which are dict keys and hence tuples)
# needs converting to a list for swap processing
sequence = list(sequence)
if node_pos1 == node_pos2:
return sequence
tmp = sequence[node_pos1]
sequence[node_pos1] = sequence[node_pos2]
sequence[node_pos2] = tmp
return sequence
def node_quality(self, sequence, pos):
if pos < 0:
pos = len(sequence) + pos
no_of_links = 0
middle_node_relations = self.node_set[sequence[pos]]
if pos > 0:
left_node_relations = self.node_set[sequence[pos - 1]]
no_of_links += len(left_node_relations.intersection(middle_node_relations))
if pos < len(sequence) - 1:
right_node_relations = self.node_set[sequence[pos + 1]]
no_of_links += len(middle_node_relations.intersection(right_node_relations))
return no_of_links
@timing
def optimise_order(self, selection_strategy):
top_score = 0
new_top_score = True
self.candidates.update(self.seeds)
while new_top_score:
top_score = max(self.candidates.values())
new_top_score = False
initial_candidates = {name for name, score in self.candidates.items() if score == top_score}
for initial_candidate in initial_candidates:
for candidate, q in selection_strategy(initial_candidate):
if q > top_score:
new_top_score = True
top_score = q
self.candidates[tuple(candidate)] = q
self.weed()
print(f"Number of candidates with top score: {len(list(self.candidates))}")
print(f"{list(self.candidates)[:3]}, ...and more")
print(f"Top score: {top_score}")
def weed(self):
# retain only top-scoring candidates
top_score = max(self.candidates.values())
low_scorers = {k for k, v in self.candidates.items() if v < top_score}
for low_scorer in low_scorers:
del self.candidates[low_scorer]
Code Glossary
node_set
: a set of labelled nodes of the form 'unique_node_id': {relation1, relation2, ..., relationN}
. The set of relations for each node can contain either no relations or an arbitrary number.
node
: a key-value pair consisting of a node_id
(key) and set of relations (value)
relation
: as used by the OP, this is a number. If two nodes both share relation 1
and they are neighbours in the sequence, it adds 1
to the quality of the sequence.
sequence
: an ordered set of node ids (e.g. ['A', 'B', 'C'] that is associated with a quality
score. The quality score is the sum of shared relations between nodes in the sequence. The output of the heuristic is the sequence or sequences with the highest quality score.
candidate
: a sequence that is currently being investigated to see if it is of high quality.
Method
generate seed sequences by stable sorting on the presence or absence of each relation in a linked item
The initially presented order is also one of the seed sequences in case it is close to optimal
For each seed sequence, pairs of nodes are swapped looking for a higher quality score
Execute a "round" for each seed sequence. A round is a shellsort-like pass over the sequence, swapping pairs of nodes, at first at a distance, then narrowing the distance until there is a distance of 1 (swapping immediate neighbours.) Keep only those sequences whose quality is more than the current top quality score
If a new highest quality score was found in this round, weed out all but top-score candidates and repeat 4 using top scorers as seeds. Otherwise exit.
Tests and Test Data Generators
The heuristic has been tested using small node_sets
, scaled data of a few hundred up to 10,000 nodes with very simple relationships, and a randomised, richly interconnected node_set
more like the OP's test data generator. Perfect single-linked sequences, linked cycles (small subsequences that link within themselves, and to each other) and shuffling have been useful to pick up and fix weaknesses.
ABC_links = {
'A': {2, 3},
'B': {1, 2},
'C': {4}
}
nick_links = {
'B': {1, 2, 4},
'C': {4},
'A': {2, 3},
'D': {4},
'E': {3},
'F': {5, 6},
'G': {2, 4},
'H': {1},
}
unbroken_chain_linked_tail_to_head = ({1, 3}, {3, 4}, {4, 5}, {5, 6}, {6, 7}, {7, 8}, {8, 9}, {9, 10}, {10, 1})
cycling_unbroken_chain_linked_tail_to_head = itertools.cycle(unbroken_chain_linked_tail_to_head)
def many_nodes_many_relations(node_count):
# data set with n nodes and random 0..n relations as per OP's requirement
relation_range = list(range(node_count))
relation_set = (
set(random.choices(relation_range, k=random.randint(0, node_count)))
for _ in range(sys.maxsize)
)
return scaled_data(node_count, relation_set)
def scaled_data(no_of_items, link_sequence_model):
uppercase = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
# unique labels using sequence of four letters (AAAA, AAAB, AAAC, .., AABA, AABB, ...)
item_names = (''.join(letters) for letters in itertools.product(*([uppercase] * 4)))
# only use a copy of the original link sequence model-- otherwise the model could be exhausted
# or start mid-cycle
# https://stackoverflow.com/questions/42132731/how-to-create-a-copy-of-python-iterator
link_sequence_model, link_sequence = itertools.tee(link_sequence_model)
return {item_name: links for _, item_name, links in zip(range(no_of_items), item_names, link_sequence)}
def shuffled(items_list):
"""relies on Python 3.6+ dictionary insertion-ordered keys"""
shuffled_keys = list(items_list.keys())
random.shuffle(shuffled_keys)
return {k: items_list[k] for k in shuffled_keys}
cycling_quality_1000 = scaled_data(501, cycling_unbroken_chain_linked_tail_to_head)
cycling_quality_1000_shuffled = shuffled(cycling_quality_1000)
linked_forward_sequence = ({n, n + 1} for n in range(sys.maxsize))
# { 'A': {0, 1}, 'B': {1, 2}, ... } links A to B to ...
optimal_single_linked_unbroken_chain = scaled_data(401, linked_forward_sequence)
shuffled_single_linked_unbroken_chain = shuffled(optimal_single_linked_unbroken_chain)
large_node_set = scaled_data(10001, cycling_unbroken_chain_linked_tail_to_head)
large_node_set_shuffled = shuffled(large_node_set)
tests = [
('ABC', 1, ABC_links, True),
('Nick 8-digit', 6, nick_links, True),
# ('Quality <1000, cycling subsequence, small number of relations', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000, True),
# ('Quality <1000, cycling subsequence, small number of relations, shuffled', 1000 - len(unbroken_chain_linked_tail_to_head), cycling_quality_1000_shuffled, True),
('Many random relations per node (0..n), n=200', '?', many_nodes_many_relations(200), True),
# ('Quality 400, single unbroken chain, initial solution is optimal', 400, optimal_single_linked_unbroken_chain, False),
# ('Quality 400, single unbroken chain, shuffled', 400, shuffled_single_linked_unbroken_chain, True),
# ('10,000 items, single unbroken chain, initial order is optimal', 10000, large_node_set, False),
# ('10,000 items, single unbroken chain, shuffled', 10000, large_node_set_shuffled, True),
]
for title, expected_quality, item_links, generate_seeds in tests:
la = LinkAnalysis(node_set=item_links, generate_seeds=generate_seeds)
seq_length = len(list(la.node_set.keys()))
print()
print('===============================')
print(f'"{title}" (sequence length {seq_length})')
print('===============================')
la.create_initial_candidates()
print('Finding heuristic candidates...')
la.optimise_order(la.heuristic_candidates)
print(f'Optimal quality: {expected_quality}')
# print('Brute Force working...')
# la.optimise_order(la.brute_force_candidates)
Performance
The heuristic is more 'practical' than the brute force solution because it leaves out many possible combinations. It may be that a lower-quality sequence produced by an element swap is actually one step away from a much higher-quality score after one more swap, but such a case might be weeded out before it could be tested.
The heuristic appears to find optimal results for single-linked sequences or cyclical sequences linked head to tail. These have a known optimal solution and the heuristic finds that solution and they may be less complex and problematic than real data.
A big improvement came with the introduction of an "incremental" quality calculation which can quickly calculate the quality difference a two-element swap makes without recomputing the quality score for the entire sequence.