1

Say I have an undirected circular sequence that looks like this:

  1 —— 2 —— 3
 /           \
1             1
|             |
3             2
 \           /
  3 —— 2 —— 3

Say I have 3 sequences as below, represented by lists of numbers:

seq1 = [1,1,3,3,2,3,2,1,3,2] # anticlockwise from top left
seq2 = [3,2,3,3,1,1,2,3,1,2] # clockwise from bottom right
seq3 = [3,1,2,3,2,3,3,1,1,2] # clockwise from top right

Since the sequence is directionless, all 3 sequences are essentially identical, and represents the circular sequence above. In reality, I have thousands of these undirected circular sequences, so it is impossible to compare every pair of them. Therefore, I want to create a unique identifier that can represent each unique undirected circular sequence. For example, the identifier should be the same for the 3 sequences above.

My idea is to treat this type of sequences as circular graphs. Then I can assign edge weights as the differences between the two connected nodes, and find the path that traverses all nodes while maximizing the sum of all edge weights. Below is my Python implementation:

def identifier(seq):
    delta_sum = float('-inf')
    res_seq = []
    for i in range(len(seq)):
        new_seq = seq[i:] + seq[:i]
        ds = sum([new_seq[j+1] - new_seq[j] for j in range(len(seq)-1)])
        if ds > delta_sum:
            delta_sum = ds
            res_seq = new_seq
        if -ds > delta_sum:
            delta_sum = -ds
            res_seq = new_seq[::-1]
    return ','.join(map(str, res_seq))

print(identifier(seq1))
print(identifier(seq2))
print(identifier(seq3))

Output:

1,1,2,3,1,2,3,2,3,3
1,1,2,3,1,2,3,2,3,3
1,2,3,2,3,3,1,1,2,3

Clearly my algorithm isn't working. It creates the same identifier for the first two sequences, but creates a different one for the 3rd sequence. Can anyone suggest a relatively fast algorithm (preferably Python code) that can create a unique identifier for this kind of sequences?

Below are some related questions, but not exactly what I want to achieve:

How to check whether two lists are circularly identical in Python

Fast way to compare cyclical data

Shaun Han
  • 2,676
  • 2
  • 9
  • 29
  • 1
    What's wrong with the second thread you linked, using lexicographically minimal string rotations? If the issue is just that your strings are reversible, you can just use the minimum rotation of the original or reversed string. – kcsquared Oct 21 '21 at 00:07
  • I think this might belong more to https://cs.stackexchange.com/questions/tagged/algorithms as it is basically a hashing method for circular graphs, isn't it? – Jonathan Scholbach Oct 21 '21 at 00:10
  • @kcsquared It only works for directed sequences – Shaun Han Oct 21 '21 at 00:12
  • Yes, I address that in the second part of my comment. Your 'undirected sequences' are just equivalence classes on ordinary strings under reversal and cyclic rotations. What's the problem with running the LMSR algorithm once on the sequence in clockwise order, once in anticlockwise order, and taking the minimum of the two as your identifier? – kcsquared Oct 21 '21 at 00:19
  • @kcsquared what if they are equal? – Shaun Han Oct 21 '21 at 00:21
  • If two equal-length strings are equal in lexicographic order, they're the same string – kcsquared Oct 21 '21 at 00:24
  • @kcsquared I am referring to the LMSR. What if the LMSR is the same for clockwise and anticlockwise? Which one do I take? – Shaun Han Oct 21 '21 at 00:26
  • Maybe I'm not understanding the question. LMSR is an algorithm that takes a string and produces a string. You store the LMSR output, which is a string, as the identifier (or hash it, etc). If the LMSR algorithm produces two identical string outputs, you can store either LMSR, since they're identical strings. – kcsquared Oct 21 '21 at 00:33
  • @kcsquared Sorry. I am actually not familiar with LMSR so maybe it's me not understanding LMSR. I will try what you suggested. Thanks. – Shaun Han Oct 21 '21 at 00:49
  • How many nodes in each graph? Are they all the same length? – wim Oct 21 '21 at 01:01

1 Answers1

3

You could use tuples as hashable identifiers and pick the smallest one from the possible rotations of the sequence:

def identifier(s):
    return min((*s[i::d],*s[:i:d]) for d in (1,-1) for i in range(len(s)))

Output:

seq1 = [1,1,3,3,2,3,2,1,3,2] # anticlockwise from top left
seq2 = [3,2,3,3,1,1,2,3,1,2] # clockwise from bottom right
seq3 = [3,1,2,3,2,3,3,1,1,2] # clockwise from top right

print(identifier(seq1))
print(identifier(seq2))
print(identifier(seq3))
(1, 1, 2, 3, 1, 2, 3, 2, 3, 3)
(1, 1, 2, 3, 1, 2, 3, 2, 3, 3)
(1, 1, 2, 3, 1, 2, 3, 2, 3, 3)

Given that the smallest tuple will start with the smallest value, you can optimize this a bit by first finding the minimum value and only comparing tuples that are formed by starting from the minimum value indexes:

def identifier(seq):
    start  = min(seq)
    starts = [i for i,v in enumerate(seq) if v == start]
    return min((*seq[i::d],*seq[:i:d]) for d in (1,-1) for i in starts)
Shaun Han
  • 2,676
  • 2
  • 9
  • 29
Alain T.
  • 40,517
  • 4
  • 31
  • 51
  • This is O(n^2), right? Might be good enough if the graphs are short. – wim Oct 21 '21 at 02:14
  • Depends on the frequency of repetition of the smallest value in the sequences, but yes the worst case is O(n^2). If the average frequency for the minimum values is 3 per sequence, then it's T(3n) or O(n x f) – Alain T. Oct 21 '21 at 02:16
  • @AlainT. I like your answer. It looks very neat. The "optimized" version, however, is not correct. You can try ``seq1 = [1,1,1,1,1,2,1,1,1,2,2]`` and ``seq2 = [1,1,1,2,1,1,1,1,1,2,2]``. It returns ``(1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2)`` and ``(1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2)``, respectively. Your first function works fine tho. – Shaun Han Oct 21 '21 at 07:58
  • My bad, I was off by 1 going backward from a specific position (didn't matter when I was going through all positions but with the optimization it does). Fixed now. – Alain T. Oct 21 '21 at 08:39
  • Even though there may be more efficient ideas possible, this is short, Pythonic, and easy to understand. +1 – wim Oct 21 '21 at 15:07