Easiest way to find relevancy between two queries

Question

I need to find the relevancy between given two queries in terms of distance for instance :

Q1(Query1) = Computing

Q2(Query2) = RAM

Let's assume the relevancy path is something like this:

Computing->Personal Computer->Computer Hardware->Computer Components->Random Access Memory->RAM

The result should be given as 5.

Now the problem is most of these graph databases like FreeBase does not support that feature. The only way is to recursively comparing one query with another one.

Question : Is there an easy and quick way to do this or are there any graph databases that supports this feature?

Note that: This is not an algorithm question, I know that this can be easily achieved by using DFS or BFS in theory, but when it comes to reality there could be a node(entry) with 1000 edges, I don't want to traverse it all.

what sort of graph do you have now? – alvas Mar 10 '14 at 07:32 — alvas, Mar 10 '14 at 07:32
@alvas I don't have any, that's why I am asking. – Sarp Kaya Mar 10 '14 at 09:26 — Sarp Kaya, Mar 10 '14 at 09:26

score 0 · Answer 1 · edited May 23 '17 at 11:55

Firstly, to find distance between graph, simply count the number of edges from one node to the other and then formulate some sort of quantification to find the distance. It is possible to put freebase entities into a graph and then count the number of edges between two nodes.

Let's start with a simpler resource, WordNet. For simplicity let's use the NLTK's WordNet API (http://www.nltk.org/). You can simply quantify path similarity (http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html) as such:

>>> from nltk.corpus import wordnet
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('computing')
[Synset('computer_science.n.01'), Synset('calculation.n.01'), Synset('calculate.v.01')]
>>> wn.synsets('ram')
[Synset('random-access_memory.n.01'), Synset('aries.n.01'), Synset('aries.n.03'), Synset('ram.n.04'), Synset('ram.n.05'), Synset('ram.v.01'), Synset('force.v.06'), Synset('crash.v.03'), Synset('jam.v.06')]
>>> computing = wn.synsets('computing')[0]
>>> ram = wn.synsets('ram')[0]
>>> computing.path_similarity(ram)
0.058823529411764705

Do note that using Wordnet might cause more problems than solving them ;P. See

I had a look at freebase, first thing is I don't have enough capacity and processing power to use their entire database(which is about 250GB). Using their API isn't that good idea too as some entries might even have 1000 outgoing entries. So, If I assume that all of these entries have at least 100 outgoing entries, then I would need to process 1000*100 = 100,000 entries for a tree of a depth of 3. I am aware of WordNet but I prefer to use an API that would do the computation for me. — Sarp Kaya, Mar 10 '14 at 09:25

score 0 · Answer 2 · edited May 23 '17 at 10:33

What you are looking for is the shortest path between two nodes in a graph. If your graph is unweighted this comes down to a breadth first search.

I assume your relevance edges are weighted so you need an algorithm like dijkstra. Since you are most likely interested in the all pair shortest path problem because you might want to query for the shortest path between any two nodes later on you could use the floyd Warshall algorithm. This creates a matrix with all the shortest path values between all pairs of nodes.

If you import your graph to neo4j you can use their floyd warshall implementation. Be aware that the implementation of this algorithm is cubic in the number of vertices. So it really depends on your use case which of the above algorithms to choose.

I have some sample code (in neo4j 1.4) attached which shows how easy it is:

        CostEvaluator<Double> costEvaluator = new CostEvaluator<Double>() {

        public Double getCost(Relationship relationship, Direction direction) {
            return -Math.log((Double) relationship.getProperty(Messages
                    .getString("RelProperty.Cost")));
        }

    };
    CostAccumulator<Double> costAccumulator = new CostAccumulator<Double>() {

        public Double addCosts(Double c1, Double c2) {
            // TODO Auto-generated method stub
            return c1 + c2;
        }
    };
    Comparator<Double> costComparator = new Comparator<Double>() {

        public int compare(Double o1, Double o2) {
            // TODO Auto-generated method stub
            return o1.compareTo(o2);
        }
    };
    FloydWarshall<Double> fw = new FloydWarshall<Double>(1.0,
            Double.MIN_VALUE, Direction.OUTGOING, costEvaluator,
            costAccumulator, costComparator, this.currencySet,
            this.tradeSet);

    fw.reset();
    fw.calculate();

you can then query your graph for any two nodes node1 and node2 with:

fw.getCost(node1, node2)

btw breadth first search can also easily be implemented in neo4j matching exactly your query by using the traverser framework or a simple cypher query

Thanks for that, but I was not asking the efficient algorithm to do that. I was wondering what lexical analysis tools can I use for my needs. — Sarp Kaya, Mar 10 '14 at 09:21

Easiest way to find relevancy between two queries

2 Answers2