Synonym chains - Efficient routing algorithm for iOS/sqlite

Question

A synonym chain is a series of closely related words that span two anchors. For example, the English words "black" and "white" can connected as:

black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white

Or, here's "true" and "false":

true-just=fair=beautiful=pretty-artful-artificial-sham-false

I'm working on a thesaurus iOS app, and I would like to display synonym chains also. The goal is to return a chain from within a weighted graph of word relations. My source is a very large thesaurus with weighted data, where the weights measure similarity between words. (e.g., "outlaw" is closely related to "bandit", but more distantly related to "rogue.") Our actual values range from 0.001 to ~50, but you can assume any weight range.

What optimization strategies do you recommend to make this realistic, e.g., within 5 seconds of processing on a typical iOS device? Assume the thesaurus has half a million terms, each with 20 associations. I'm sure there's a ton of prior research on these kinds of problems, and I'd appreciate pointers on what might be applied to this.

My current algorithm involves recursively descending a few levels from the start and end words, and then looking for intercepting words, but that becomes too slow with thousands of sqlite (or Realm) selects.

Is your goal to find a synonym chain (which appears to be a single *path* between two given vertices) or a MST? — j_random_hacker, Jun 30 '15 at 12:24
Good question. I used the term wrong, it's not a MST. We might have half a million nodes, but only 6 in the chain. — some ideas, Jun 30 '15 at 13:11
OK, then this paper on approximate shortest paths might be useful: https://people.mpi-inf.mpg.de/~sseufert/papers/aspsn-cikm.pdf — j_random_hacker, Jun 30 '15 at 13:31
Thanks! And the sample sizes of test data sets (millions of vertices) is the right order of magnitude for me. — some ideas, Jun 30 '15 at 15:05

score 2 · Answer 1 · answered Jun 29 '15 at 23:25

Since you said your source is a large thesaurus with weighted data, I'm assuming if you pick any word, you will have the weight to its successor in the similarity graph. I will always use the sequence below, when I'm giving any example:

black-dark-obscure-hidden-concealed-snug-comfortable-easy-simple-pure-white

Let's think of the words as being a node on a graph, each relationship of similarity a word has with another is a path on that graph. Each path is weighted with a cost, which is the weight you have on the source file. So the best solution to find a path from one word to another is to use the A* (A star) path finding.

I'm using the minimum "cost" to travel from a word to its successor to be 1. You can adjust it accordingly. First you will need a good heuristic function to use, since this is a greedy algorithm. This heuristic function will return the "greedy" distance between two words, any words. You must respect the fact the the "distance" it returns can never be bigger than the real distance between the two words. Since I don't know any relationship between any words for a thesaurus, my heuristic function will always return the minimum cost 1. In other words, it will always say a word is the most similar word to any other. For example, my heuristic function tells me that 'black' is the best synonym for 'white'.

You must tune the heuristic function if you can, so it will respond with more accurate distances making the algorithm runs faster. That's the tricky part I guess.

You can see the pseudo-code for the algorithm on the Wikipedia article I sent. But here it is for a faster explanation:

function A*(start,goal)
    closedset := the empty set    -- The set of nodes already evaluated.
    openset := {start}    -- The set of tentative nodes to be evaluated, initially containing the start node
    came_from := the empty map    -- The map of navigated nodes.

    g_score[start] := 0    -- Cost from start along best known path.
    -- Estimated total cost from start to goal through y.
    f_score[start] := g_score[start] + heuristic_cost_estimate(start, goal)

    while openset is not empty
        current := the node in openset having the lowest f_score[] value
        if current = goal
            return reconstruct_path(came_from, goal)

        remove current from openset
        add current to closedset
        for each neighbor in neighbor_nodes(current)
            if neighbor in closedset
                continue
            tentative_g_score := g_score[current] + dist_between(current,neighbor)

            if neighbor not in openset or tentative_g_score < g_score[neighbor] 
                came_from[neighbor] := current
                g_score[neighbor] := tentative_g_score
                f_score[neighbor] := g_score[neighbor] + heuristic_cost_estimate(neighbor, goal)
                if neighbor not in openset
                    add neighbor to openset

    return failure

function reconstruct_path(came_from,current)
    total_path := [current]
    while current in came_from:
        current := came_from[current]
        total_path.append(current)
    return total_path

Now, for the algorithm you'll have 2 arrays of nodes, the ones you are going to visit (opened) and the ones you already visited (closed). You will also have two arrays of distances for each node, that you will be completing as you travel through the graph.

One array (g_score) will tell you the real lowest traveled distance between the starting node and the specified node. For example, g_score["hidden"] will return the lowest weighted cost to travel from 'black' to 'hidden'.

The other array (f_score) will tell you the supposed distance between the node you specified to the goal you want to reach. For example, f_score["snug"] will return the supposed weighted cost to travel from "snug" to "white" using the heuristic function. Remember, this cost will always be less or equal the real cost to travel between words, since our heuristic function need to respect the aforementioned rule.

As the algorithm runs, you will be traveling from node to node, from the starting word, saving all the nodes you traveled and the costs you "used" to travel. You will be replacing the traveled path when you find a better cost to travel on the g_score array. You will use the f_score to predict which node will be best visited first, from the array of 'unvisited' nodes. It's best if you save your f_score as a minimum Heap.

You will end the algorithm when you find the node that is the goal that you want. Then you will reconstruct the minimum path using the array of nodes visited that you kept saving at each iteration. Another way the algorithm will stop is if it visited all neighbor nodes and didn't find the goal. When this happens, you can say there is no path from the starting node to the goal.

This algorithm is the most used on games to find the better path between two objects on a 3D world. To improve it, you just need to create a better heuristic function, that can let the algorithm find the better nodes to travel first, leding it to the goal faster.

-- 7f

score 1 · Answer 2 · answered Mar 09 '21 at 19:04

1

Here's a closely related question and answer: Algorithm to find multiple short paths

There you can see comments about Dijkstra's and A-star, Dinic's, but more broadly also the idea of maximum flow and minimum cost flow.

answered Mar 09 '21 at 19:04

Michael Douma

1,144
8
21

Synonym chains - Efficient routing algorithm for iOS/sqlite

2 Answers2