1

I've just implemented full text search engine based on Hibernate Search under the hood.

I'm searching solution for one issue. I have texts with Polish (UTF-8) characters, like: "zażółć gęślą jaźń". When I'm searching for "jaźń" everything is OK and result is found. But when I'm searching for "jazn" the result is not found.

I would like to search for all possible terms: "jaźń", "jazń", "jaźn", and "jazn" and find the searched "zażółć gęślą jaźń" text. How can I configure Hibernate Search to do so?

Piotr Pradzynski
  • 4,190
  • 5
  • 23
  • 43

1 Answers1

7

You have to define an analyzer to analyze your text before indexing/querying.

See the Hibernate Search documentation section 1.8, on analyzers, and for more complete information on analysis, section 4.3

To fix your issue, the analyzer you define have to include the ASCIIFoldingFilter which transforms non-ASCII characters to their nearest ASCII equivalent (and probably the LowerCaseFilter too). See this example

If you are using the Hibernate Search DSL to build your queries, it's done automatically. If you build your queries with stock Lucene, you have an example here, which binds the analyzer automatically to the fields.

Note that wildcard queries are not analyzed by default, so if you use wildcards, you'll need to clean up your string before passing it to the query.

You can see an example of how to sanitize your queries for wildcard here.

It uses ASCIIFoldingFilter underneath with this sort of code.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
Guillaume Smet
  • 9,921
  • 22
  • 29
  • Thanks. But this will only involve indexing procedure I think. So I will have "zazolc gesla jazn" in index. And now when I will search with "jaźń" will it be working? Or I need to preproduce the search term somehow? – Piotr Pradzynski May 31 '16 at 10:08
  • I completed the answer. You should have all the pointers now. – Guillaume Smet May 31 '16 at 15:11
  • Thanks! I'm using [Querydsl for Hibernate Search](https://github.com/querydsl/querydsl/tree/master/querydsl-hibernate-search) and I do not see how to use the ASCIIFoldingFilter there so I probably need to prepare the search term by myself before sending it to Querydsl, right? – Piotr Pradzynski Jun 01 '16 at 08:57
  • 1
    Thanks to @DavidS answer on [this question](http://stackoverflow.com/questions/3322152/is-there-a-way-to-get-rid-of-accents-and-convert-a-whole-string-to-regular-lette) I used Apache Commons [StringUtils.stripAccents(input)](https://commons.apache.org/proper/commons-lang/javadocs/api-3.1/org/apache/commons/lang3/StringUtils.html#stripAccents(java.lang.String)) to prepare the term. Now everything looks good. Thanks for help! – Piotr Pradzynski Jun 01 '16 at 09:39