5

I'm looking for a text search engine for a non-traditional sort of text search and I want advice on which tool (Lucene, Sphinx, Xapian, or something else) is most appropriate for me, plus pointers on where to get started.

I have molecules represented as graphs (atoms and bond). I have a way to enumerate all subgraphs of up to size k. Being technical, the inputs are SMILES and the output is canonical SMARTS and the number of times each subgraph/SMARTS occurs.

For example, if the input molecule is "CCO" then the canonical results are {"C": 2, "O": 1, "CC": 1, "OC": 1, "CCO": 1} and if the molecule is "SCO" then the canonical results are {"C": 1, "S": 1, "O": 1, "CS": 1, "OC": 1, "SCO": 1}. These are tiny examples. For real molecule I got around 500 "words", which look like "CC(C)O", "CCCOCC", "cn" and "cccc(c)O".

Looking at molecules as a collections of characteristic strings plus counts means I should be able to use a text search tool to do comparisons at the text level, with the hopes that they are meaningful at the chemistry level.

For examples, I can use cosine similarity perhaps with tf-idf weight and find similar molecules by looking for similar subpatterns. With the "CCO" and "SCO" examples above, the cosine similarity is (2*1+1*1+1*1)/sqrt(2*2+1*1+1*1+1*1+1*1)/sqrt(6*(1*1)) = 4/sqrt(8*6) = 0.58.

For another example, if I want to find molecules which contain a "CCS" substructure then I can do a fast inverted index search based on the counts (the molecules must have at least 2 "C"s, at least 1 "CS", and so on) before tackling the NP subgraph isomorphism problem. That is, text-based methods can act as a filter to reject obvious mismatches.

I'm trying to figure out the text solutions which exist but it's a bit daunting. I don't need stop words, I don't need stemming, I don't care about word order; I don't need quite a number of the features which exist. I do need the ability to keep word vectors, since it's important to know if "C" appears 2 time or 3.

Which text search engine is most appropriate for me? It looks like Lucene, especially with the work in Mahout. Can you recommend which parts of the documentation to look at or relevant tutorials? The ones I've found are meant for full-text searches, with stemming and the other features I don't need.

Andrew Dalke
  • 14,889
  • 4
  • 39
  • 54
  • What does "similarity" mean to you? E.g. should "C=C" be "similar" to "C-C"? is "N+" similar to "N"? Is "cco" similar to "c(c)o" etc? Perhaps if you gave a few example searches and the results they should find it would help us to know more about what you want (since we are not chemists). – Xodarap Jan 14 '11 at 16:38
  • I have words W_i with repeat counts n_i and i < ~500. I want to do cosine similarity between them, as per the linked definition. I think what I'm looking for is standard in the document search world and the chemistry doesn't matter, but I'll update with an example. – Andrew Dalke Jan 15 '11 at 04:08
  • See also http://stackoverflow.com/questions/2380394/simple-implementation-of-n-gram-tf-idf-and-cosine-similarity-in-python . – Andrew Dalke Jan 17 '11 at 23:09

3 Answers3

1

Hmm... don't really know what are SMARTS, or how chemical similarity actually work. If you want to use lucene, first consider using solr. Since your data is in graphs, you can take a look at neo4j with the solr component. Also, would this problem be more closely related to document near duplicates? For helping with that there are a number of algorithms LSH, Spotsigs, shingling, and simhash. Wish I could be of more help.

Joyce
  • 1,431
  • 2
  • 18
  • 33
  • I want to see if text search can replace or simplify graph search. With 50 million molecules that's about 150 million atoms and as many bonds. I don't see how a generic graph db like neo4j can approach the capabilities of specialized chemistry search engines. But doing a cosine similarity search of 50 million documents each containing at most 1,000 words (all unique) should be easy. I'm looking for a tool for that task. – Andrew Dalke Jan 14 '11 at 15:09
  • 1
    Ok I see what you mean, well Solr is pretty easy to use. It is another layer on top of lucene. Do you know how many fields you may have per chemical? Use the Keyword tokenizer so that each of the input into a field that get indexed does not get tokenized, and just don't filter indexing process with stemming or other special features. I do recommend that you get the book published by Packt. I think that is perhaps the only book avail on enterprise uses of the search engine. – Joyce Jan 14 '11 at 15:40
  • Each compound has about 200-600 "words" selected from a vocabulary of about 200,000 words. Thanks for the book recommendation! – Andrew Dalke Jan 16 '11 at 18:54
1

EDIT: I may have understood this better now. You want to compare graphs, represented as strings. The strings have "words" which may repeat. You may use Lucene, in which case I second the suggestion to use Solr. Basically, each Solr document will consist of a single field; The field will contain the string, which I suggest you unroll: write C C instead of C:2. If you use a space to separate the words, you can use a WhiteSpaceAnalyzer. If you use another separator, you may need to write a custom analyzer, which is not so hard to do.

Is this a good idea? I am not sure. Here's why:

  1. Lucene (and Solr) do not use cosine similarity as such, but rather Lucene Similarity, which mixes cosine, TF/IDF and boolean scoring, with some specific modifications. This works well for most textual use-cases, but may be different than what you need.
  2. Do you need to compare hits from different searches? If you do, it is hard to do using Solr, as it normalized every search to a maximal value of 1.

I suggest you do try Solr for a small sample of your database. If Solr works for you, fine. If not, shingling and min-hashes are probably the way to go. Mining of Massive Datasets by Rajaraman and Ullman is a recent free book about these subjects. I suggest you read it. It covers search for similar strings in mountains of data. I guess the differentiator is: Do you need a relatively large intersection? If so, use shingling and min-hashes. If not, maybe Solr is enough.

Yuval F
  • 20,565
  • 5
  • 44
  • 69
  • String matching and sequence alignment? How so? My "documents" contain "words", which cam be repeated. Given a query document and a target document collection, I want to find the 10 nearest in the collection based on (say) cosine similarity. Alignment algorithms implies order, which my data doesn't have. Needleman–Wunsch, Aho–Corasick and other string match algorithms just aren't applicable, at least not as far as I can tell. (BTW, I did work in bioinformatics for a bit, so I know some of the places when they can be used.) – Andrew Dalke Jan 16 '11 at 01:44
  • I have edited my answer to better address your documents and words. – Yuval F Jan 16 '11 at 19:47
  • I started reading that book the other day and it's very helpful. I will try with Solr and see what happens. I also came across gensim at http://nlp.fi.muni.cz/projekty/gensim/index.html . – Andrew Dalke Jan 16 '11 at 23:56
0

Don't use lucene. Or Solr. The internal models are antiquated and cobbled together; although they do a good job. Find an engine with the minimal criteria (if you want to map inside a text engine) BM25F fully supported. If I were after it and I wanted scalability and performance and low cost support community, frankly I'd go with SQL Server and cubes.Licensing with SQL Server could be a complete blocker. Good luck.

willemIP
  • 1
  • 1
  • I have no idea of why BM25F would be appropriate for what I'm doing. Why would it be better than cosine similarity? A friend suggested Xapian, which has BM25 support, but it doesn't appear to be that widely used. I use Macs and other unix variants so a Windows-only solution won't work. – Andrew Dalke Jan 15 '11 at 04:40