Lucene text search failing for names ending with a non-alphabet

Question

In my Webmethods application I need to implement a Search functionality and I had done it with Lucene. But the search is not retrieving results when I am searching for file with title ending in something other than alpabet.for eg:- doc1.txt or new$.txt
In the code below when i try to print queryCmbd its printing Search Results>>>>>>>title:"doc1 txt" (contents:doc1 contents:txt).when I search for a string like doc.txt, the result is Search Results>>>>>>>title:"doc.txt" contents:doc.txt. What should be done in order to parse these kinds of strings(like doc1.txt,new$.txt)?

 public java.util.ArrayList<DocNames> searchIndex(String querystr,
                String path, StandardAnalyzer analyzer) {
            String FIELD_CONTENTS = "contents";
            String FIELD_TITLE = "title";
            String queryStringCmbd = null;

            queryStringCmbd = new String();

            String queryFinal = new String(querystr.replaceAll(" ", " AND "));
            queryStringCmbd = FIELD_TITLE + ":\"" + queryFinal + "\" OR "
                    + queryFinal;


            try {

                FSDirectory directory = FSDirectory.open(new File(path));

                Query q = new QueryParser(Version.LUCENE_36, FIELD_CONTENTS,
                        analyzer).parse(querystr);

                Query queryCmbd = new QueryParser(Version.LUCENE_36,
                        FIELD_CONTENTS, analyzer).parse(queryStringCmbd);

                int hitsPerPage = 10;
                IndexReader indexReader = IndexReader.open(directory);
                IndexSearcher indexSearcher = new IndexSearcher(indexReader);

                TopScoreDocCollector collector = TopScoreDocCollector.create(
                        hitsPerPage, true);
                indexSearcher.search(queryCmbd, collector);
                ScoreDoc[] hits = collector.topDocs().scoreDocs;

                System.out
                        .println("Search Results>>>>>>>>>>>>"
                                + queryCmbd);

                docNames = new ArrayList<DocNames>();
                for (int i = 0; i < hits.length; ++i) {
                    int docId = hits[i].doc;
                    Document d = indexSearcher.doc(docId);
                    DocNames doc = new DocNames();
                    doc.setIndex(i + 1);
                    doc.setDocName(d.get("title"));
                    doc.setDocPath(d.get("path"));
                    if (!(d.get("path").contains("indexDirectory"))) {
                        docNames.add(doc);
                    }
                }

                indexReader.flush();
                indexReader.close();
                indexSearcher.close();
                return docNames;
            } catch (CorruptIndexException e) {
                closeIndex(analyzer);
                e.printStackTrace();
                return null;
            } catch (IOException e) {
                closeIndex(analyzer);
                e.printStackTrace();
                return null;
            } catch (ParseException e) {
                closeIndex(analyzer);
                e.printStackTrace();
                return null;
            }
        }

score 2 · Accepted Answer · answered Dec 04 '12 at 13:57

Your problem comes from the fact you're using StandardAnalyzer. If you read its javadoc, it tells that it's using StandardTokenizer for token splitting. This means phrases like doc1.txt will be split into doc1 and txt.

If you want to match the entire text, you need to use KeywordAnalyzer- both for indexing and searching. The code below displays the difference: using StandardAnalyzer tokens are {"doc1", "txt"} and using KeywordAnalyzer the only token is doc1.txt.

    String foo = "foo:doc1.txt";
    StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_34);
    TokenStream tokenStream = sa.tokenStream("foo", new StringReader(foo));
    while (tokenStream.incrementToken()) {
        System.out.println(tokenStream.getAttribute(TermAttribute.class).term());
    }

    System.out.println("-------------");

    KeywordAnalyzer ka = new KeywordAnalyzer();
    TokenStream tokenStream2 = ka.tokenStream("foo", new StringReader(foo));
    while (tokenStream2.incrementToken()) {
        System.out.println(tokenStream2.getAttribute(TermAttribute.class).term());
    }

Thanks, that really worked! but can we use KeywordAnalyzer to find files having content like 'X X X X X X X X X X X X' or in case of search string containing spaces? I tried but failed to get results? What can be done in such cases? Kindly help — Cheese, Jan 21 '13 at 09:33
You need to provide more details, and the best thing to do is to open a new question and show the code. Also, if this answer had helped, you might want to accept it, otherwise people will stop providing you answers. — mindas, Jan 21 '13 at 10:41

Lucene text search failing for names ending with a non-alphabet

1 Answers1