1

I have indexed some Wikipedia URLs along with the contents of these pages. I would like to be able to search through both of these fields simultaneously or search by only one field with an additional option to allow wildcards on any given field (although I am not sure if I would ever use a wildcard option on the content input), so I can not only find

https://en.wikipedia.org/wiki/Polish_songs_(Chopin)

but I can also find

https://en.wikipedia.org/wiki/Fr%C3%A9d%C3%A9ric_Chopin

I have Lucene version 7.1.0 and couldn't completely grasp the documentation, so I did a thorough search on SO and based my code on the following Q&A (mostly from the old Lucene versions, but the functionalities seem to be the same):

Exact Phrase search using Lucene?

How to do a Multi field - Phrase search in Lucene?

Phrase query in Lucene 6.2.0

How to match exact text in Lucene search?

Based on the answers above, I came up with some ideas on how to conduct a search using the following input:

Assumptions:

Query q;
QueryParser qp;

/* I think this is fishy but do not know how to do it differently */
if (contentOnlySearch) {
  qp = new QueryParser("content", analyzer);
} else {
  qp = new QueryParser("url", analyzer);
}
/* end */

String urlQuery = "chopin";
String contentQuery = "polish songs that chopin wrote";
if (allowUrlQueryWildcards) {
  urlQuery = "*" + urlQuery + "*";
}
if (allowContentQueryWildcards) {
  contentQuery = "*" + contentQuery + "*";
}

Idea 1:

q = MultiFieldQueryParser.parse(
  new String[] {urlQuery, contentQuery},
  new String[] {"url", "content"},
  analyzer
);

Idea 2:

q = qp.parse("url:\"" + urlQuery + "\" AND content:\"" + contentQuery + "\"");

1. Now, both of these approaches along with some others failed miserably, I was either getting an error or most often finding a website which does not contain the input contentQuery. What am I doing wrong?

Side questions:

2. If I have around 10000 URLs saved, is allowing wildcards a good choice here, or should I tweak the indexing/searching process a bit to be able to find both these URLs by inputting "Chopin"?

3. Why does lucene find and index "Chopin" in "(Chopin)", but not in "_Chopin"?

PS. I come from JavaScript, I hope my Java code is correct and does not cause your eyes to bleed.

James Pond
  • 267
  • 1
  • 4
  • 18

0 Answers0