2

I'd like to implement a searchable index using Lucene.Net 4.8 that supplies a user with suggestions / autocomplete for single words & phrases.

The index has been created successfully; the suggestions are where I've stalled.

Version 4.8 seems to have introduced a substantial number of breaking changes, and none of the available samples I've found work.

Where I stand

For reference, LuceneVersion is this:

private readonly LuceneVersion LuceneVersion = LuceneVersion.LUCENE_48;

Solution 1

I've tried this, but can't get past reader.Terms:

    public void TryAutoComplete()
    {
        var analyzer = new EnglishAnalyzer(LuceneVersion);
        var config = new IndexWriterConfig(LuceneVersion, analyzer);
        RAMDirectory dir = new RAMDirectory();
        using (IndexWriter iw = new IndexWriter(dir, config))
        {
            Document d = new Document();
            TextField f = new TextField("text","",Field.Store.YES);
            d.Add(f);
            f.SetStringValue("abc");
            iw.AddDocument(d);
            f.SetStringValue("colorado");
            iw.AddDocument(d);
            f.SetStringValue("coloring book");
            iw.AddDocument(d);
            iw.Commit();
            using (IndexReader reader = iw.GetReader(false))
            {
                TermEnum terms = reader.Terms(new Term("text", "co"));
                int maxSuggestsCpt = 0;
                // will print:
                // colorado
                // coloring book
                do
                {
                    Console.WriteLine(terms.Term.Text);
                    maxSuggestsCpt++;
                    if (maxSuggestsCpt >= 5)
                        break;
                }
                while (terms.Next() && terms.Term.Text.StartsWith("co"));
            }
        }
    }

reader.Terms no longer exists. Being new to Lucene, it's unclear how to refactor this.

Solution 2

Trying this, I'm thrown an error:

    public void TryAutoComplete2()
    {
        using(var analyzer = new EnglishAnalyzer(LuceneVersion))
        {
            IndexWriterConfig config = new IndexWriterConfig(LuceneVersion, analyzer);
            RAMDirectory dir = new RAMDirectory();
            using(var iw = new IndexWriter(dir,config))
            {
                Document d = new Document()
                {
                    new TextField("text", "this is a document with a some words",Field.Store.YES),
                    new Int32Field("id", 42, Field.Store.YES)
                };

                iw.AddDocument(d);
                iw.Commit();

                using (IndexReader reader = iw.GetReader(false))
                using (SpellChecker speller = new SpellChecker(new RAMDirectory()))
                {
                    //ERROR HERE!!!
                    speller.IndexDictionary(new LuceneDictionary(reader, "text"), config, false);
                    string[] suggestions = speller.SuggestSimilar("dcument", 5);
                    IndexSearcher searcher = new IndexSearcher(reader);
                    foreach (string suggestion in suggestions)
                    {
                        TopDocs docs = searcher.Search(new TermQuery(new Term("text", suggestion)), null, Int32.MaxValue);
                        foreach (var doc in docs.ScoreDocs)
                        {
                            System.Diagnostics.Debug.WriteLine(searcher.Doc(doc.Doc).Get("id"));
                        }
                    }
                }
            }
        }
    }

When debugging, speller.IndexDictionary(new LuceneDictionary(reader, "text"), config, false); throws a The object cannot be set twice! error, which I can't explain.

Any thoughts are welcome.

Clarification

I'd like to return a list of suggested terms for a given input, not the documents or their full content.

For example, if a document contains "Hello, my name is Clark. I'm from Atlanta," and I submit "Atl," then "Atlanta" should come back as a suggestion.

user
  • 1,261
  • 2
  • 21
  • 43

1 Answers1

2

If I am understanding you correctly you may be over-complicating your index design a bit. If your goal is to use Lucene for auto-complete, you want to create an index of the terms you consider complete. Then simply query the index using a PrefixQuery using a partial word or phrase.

using Lucene.Net.Analysis;
using Lucene.Net.Analysis.En;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.Search;
using Lucene.Net.Store;
using Lucene.Net.Util;
using System;
using System.Linq;

namespace LuceneDemoApp
{
    class LuceneAutoCompleteIndex : IDisposable
    {
        const LuceneVersion Version = LuceneVersion.LUCENE_48;
        RAMDirectory Directory;
        Analyzer Analyzer;
        IndexWriterConfig WriterConfig;

        private void IndexDoc(IndexWriter writer, string term)
        {
            Document doc = new Document();
            doc.Add(new StringField(FieldName, term, Field.Store.YES));
            writer.AddDocument(doc);
        }

        public LuceneAutoCompleteIndex(string fieldName, int maxResults)
        {
            FieldName = fieldName;
            MaxResults = maxResults;
            Directory = new RAMDirectory();
            Analyzer = new EnglishAnalyzer(Version);
            WriterConfig = new IndexWriterConfig(Version, Analyzer);
            WriterConfig.OpenMode = OpenMode.CREATE_OR_APPEND;
        }

        public string FieldName { get; }
        public int MaxResults { get; set; }

        public void Add(string term)
        {
            using (var writer = new IndexWriter(Directory, WriterConfig))
            {
                IndexDoc(writer, term);
            }
        }

        public void AddRange(string[] terms)
        {
            using (var writer = new IndexWriter(Directory, WriterConfig))
            {
                foreach (string term in terms)
                {
                    IndexDoc(writer, term);
                }
            }
        }

        public string[] WhereStartsWith(string term)
        {
            using (var reader = DirectoryReader.Open(Directory))
            {
                IndexSearcher searcher = new IndexSearcher(reader);
                var query = new PrefixQuery(new Term(FieldName, term));
                TopDocs foundDocs = searcher.Search(query, MaxResults);
                var matches = foundDocs.ScoreDocs
                    .Select(scoreDoc => searcher.Doc(scoreDoc.Doc).Get(FieldName))
                    .ToArray();

                return matches;
            }
        }

        public void Dispose()
        {
            Directory.Dispose();
            Analyzer.Dispose();
        }
    }
}

Running this:

var indexValues = new string[] { "apple fruit", "appricot", "ape", "avacado", "banana", "pear" };
var index = new LuceneAutoCompleteIndex("fn", 10);
index.AddRange(indexValues);

var matches = index.WhereStartsWith("app");
foreach (var match in matches)
{
    Console.WriteLine(match);
}

You get this:

apple fruit
appricot
Timothy Jannace
  • 1,401
  • 12
  • 18
  • This worked. Appreciate the help. Now it's a matter of getting my long, document-based suggestions to be useful to humans... – user Mar 05 '20 at 21:05
  • Awesome, glad that worked for you. You may know this already, but you can add multiple fields to a single document. One field can be the full document and another can be used as a display name or other tag like that. You could search based on one field and display another. – Timothy Jannace Mar 05 '20 at 21:36
  • Yeah, that I've got down ;). It's rather that the suggestions I'm getting are paragraphs. Ex. With a body like "Hi, my name is Tom Smith. This is a long paragraph about my life," searching Tom nets me the whole paragraph in some cases. Anyway, not a huge problem, it's more parsing the results. Open to suggestions, of course. – user Mar 05 '20 at 21:45
  • Looks like this'll help: https://stackoverflow.com/questions/34529261/lucene-wildcard-query-with-space – user Mar 05 '20 at 21:46
  • So your example would return just the word "Tom"? Or "Tom Smith. This is a long paragraph about my life." Or something else? – Timothy Jannace Mar 05 '20 at 21:59
  • "Tom Smith. This is a long paragraph..." - that's the behavior I'm seeing at the moment. – user Mar 05 '20 at 22:04
  • I'm also finding that this isn't working with stemmed words. I'm also using the EnglishAnalyzer, so entering in "estimate" returns no results, while "estim" does. Any idea how to account for that? – user Mar 06 '20 at 01:26
  • Sounds like we need to try a different analyzer. The english analyzer does do some things with word stems. – Timothy Jannace Mar 06 '20 at 01:42
  • Yeah, figured as much. Appreciate it. – user Mar 06 '20 at 01:44