How to deal with identifier fields in Lucene?

Question

I've stumbled upon a problem similar to the one described in this other question: I have a field named like 'type', which is an identifier, ie, it's case sensitive and I want to use it for exact searches, no tokenisation, no similarity searches, just plain "find exactly 'Sport:01'". I might benefit from 'Sport*', but it's not extremely important in my case.

I cannot make it work: I thought the right kind of field to store this is: StringField.TYPE_STORED, with DOCS_AND_FREQS_AND_POSITIONS and setOmitNorms ( true ). However, this way I can't correctly resolve a query like: +type:"RockMusic" +title: "a sample title" using the standard analyzer, because, as far as I understand, the analyzer converts the input into lower case (ie, rockmusic) and the type is stored in its original mixed-case form (hence, I cannot resolve it even if I remove the title clause).

I'd like to mix case-insensitive search over title with case-sensitive over type, since I've cases where type := BRAIN is an acronym and it's different than 'Brain'.

So, what's the best way to manage fields and searches like the above? Are there alternatives other than text and string fields?

I'm using Lucene 6.6.0, but this is a general issue, regarding multiple (all?) Lucene versions.

Some code showing details is here (see testIdMixedCaseID*). The real use case is rather more complicated, if you want to give a look, the problem is with the field CC_FIELD, which might be 'BioProc' and nothing can be found in such a case.

Please note I need to use the plain Lucene, not Solr or Elastic search.

Can you add the relevant parts of your code to the question? What version of Lucene are you using? — andrewJames, Jun 01 '20 at 00:23
@andrewjames, I've added a few details, though the question is general hence they aren't very relevant. Thanks. — zakmck, Jun 01 '20 at 08:57

score 2 · Answer 1 · answered Jun 03 '20 at 00:08

The following notes are based on Lucene 8.x, not on Lucene 6.6 - so there may be some syntax differences - but I take your point about how any such differences should be coincidental to your question.

Here are some notes, where I will focus on the following aspect of your question:

However, this way I can't correctly resolve a query like: +type:"RockMusic" +title:"a sample title" using the standard analyzer

I think there are 2 parts to this:

Firstly, the query example using "a sample title" will - as you say - not work well with how a standard analyzer works - for the reasons you state.

But, secondly, it is possible to combine the two types of query you want to use, in a way which I believe gets you what you need: An exact match for the type field (e.g. RockMusic) and a more traditional tokenized & case-insensitive result for the title field (a sample title).

Here is how I would do that:

Here is some simple test data:

public static void buildIndex() throws IOException {
    final Directory dir = FSDirectory.open(Paths.get(INDEX_PATH));
    Analyzer analyzer = new StandardAnalyzer();
    IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
    iwc.setOpenMode(OpenMode.CREATE);
    Document doc;

    try (IndexWriter writer = new IndexWriter(dir, iwc)) {
        doc = new Document();
        doc.add(new StringField("type", "RockMusic", Field.Store.YES));
        doc.add(new TextField("title", "a sample title", Field.Store.YES));
        writer.addDocument(doc);

        doc = new Document();
        doc.add(new StringField("type", "RockMusic", Field.Store.YES));
        doc.add(new TextField("title", "another different title", Field.Store.YES));
        writer.addDocument(doc);

        doc = new Document();
        doc.add(new StringField("type", "Rock Music", Field.Store.YES));
        doc.add(new TextField("title", "a sample title", Field.Store.YES));
        writer.addDocument(doc);

    }
}

Here is the query code:

public static void doSearch() throws QueryNodeException, ParseException, IOException {

    IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
    IndexSearcher searcher = new IndexSearcher(reader);

    TermQuery typeQuery = new TermQuery(new Term("type", "RockMusic"));

    Analyzer analyzer = new StandardAnalyzer();
    QueryParser parser = new QueryParser("title", analyzer);
    Query titleQuery = parser.parse("A Sample Title");

    Query query = new BooleanQuery.Builder()
            .add(typeQuery, BooleanClause.Occur.MUST)
            .add(titleQuery, BooleanClause.Occur.MUST)
            .build();

    System.out.println("Query: " + query.toString());
    System.out.println();

    TopDocs results = searcher.search(query, 100);
    ScoreDoc[] hits = results.scoreDocs;
    for (ScoreDoc hit : hits) {
        System.out.println("doc = " + hit.doc + "; score = " + hit.score);
        Document doc = searcher.doc(hit.doc);
        System.out.println("Type = " + doc.get("type")
                + "; Title = " + doc.get("title"));
        System.out.println();
    }
}

The output from the above query is as follows:

Query: +type:RockMusic +(title:a title:sample title:title)

doc = 0; score = 0.7016101
Type = RockMusic; Title = a sample title

doc = 1; score = 0.2743341
Type = RockMusic; Title = another different title

As you can see, this query is a little different from the one taken from your question.

But the list of found documents shows that (a) the Rock Music document was not found at all (good - because Rock Music does not match the "type" search term of RockMusic); and (b) the title a sample title got a far higher match score than the another different title document, when searching for A Sample Title.

Additional notes:

This query works by combining a StringField exact search with a more traditional TextField tokenized search - this latter search being processed by the StandardAnalyzer (matching how the data was indexed in the first place).

I am making an assumption about the score ranking being useful to you - but for title searches, I think that is reasonable.

This approach would also apply to your BRAIN vs. brain example, for StringField data.

(I also assume that, for a user interface, a user could select the "RockMusic" type value from a drop-down, and enter the "A Sample Title" search in an input field - but this is getting off-topic, I think).

You could obviously enhance the analyzer to include stop-words, and so on, as needed.

Of course, my examples involve hard-coded data - but it would not take much to generalize this approach to handle dynamically-provided search terms.

Hope that this makes sense - and that I understood the problem correctly.

@anrewjames thanks so much for the deep analysis. I've ended up to similar code, with the addition of a PerFieldAnalyzerWrapper, which is able to select KeywordAnalyzer for fields like "type", and use the standard analyzer as default. I've had to struggle to realise that using the same analyzer that I used to index during search requires a query parser, but now it's working, will post my answer later. Thanks again! — zakmck, Jun 03 '20 at 09:29
I had not noticed `PerFieldAnalyzerWrapper` - that is a very useful class for this situation. — andrewJames, Jun 04 '20 at 13:35

zakmck · Accepted Answer · 2020-06-17T15:48:24.143

Going to answer myself...

I discovered what @andrewjames outlines in his excellent analysis by making a number of tests of my own. Essentially, fields like "type" don't play well with the standard analyser and they are best indexed and searched with an analyzer like KeywordAnalyzer, which, in practice, stores the original value as-is and searches it accordingly.

Most real cases are like my example, ie, mixed ID-like fields, which need exact matching, plus fields like 'title' or 'description', which best serves user searches using per-token searching, word-based scoring, stop words elimination, etc.

Because of that, PerFieldAnalyzerWrapper (see also my sample code, linked above) comes to much help, ie, a wrapper analyzer, which is able to dispatch analysis field-specific analyzers, on a field name basis.

One thing to add is that I still haven't clear which analyzer is used when a query is built without a parser (eg, using new TermQuery ( new Term ( fname, fval )), so now I use a QueryParser.

How to deal with identifier fields in Lucene?

2 Answers2