Fast ESP character normalization

Question

I'm running a search application on a FAST ESP server. Now I have this problem with character normalization.

What I want is to search for 'wurth' and get a hit in 'würth'.

i've tried configuring the following in esp/etc/tokenizer/tokenization.xml

 <normalizationlist name="German to Norwegian">
   <normalization description="German u with diaeresis, to Norwegian u">
      <input>x75</input> 
      <output>xFC</output> 
      <output>x75</output>
   </normalization>
  </normalizationlist>

but of cours, this translate all u to ü, which is useless.

How do I configure this the right way?

score 1 · Accepted Answer · answered Oct 20 '09 at 08:05

1

The solution is to normalize every "special character" to the same "normal character";

ö -> o ø -> o å -> a ä -> a æ -> a

This is at bit time consuming, but it works!

answered Oct 20 '09 at 08:05

jorgen

1,217
6
22
41

score 0 · Answer 2 · answered Nov 03 '09 at 20:14

0

Read the Avanced Logistics Guide. It contains a chapter on Character Normalization. When you follow the steps from the guide all special characters will be treated as normal characters. So searching for über will give the same results as when searching for uber.

answered Nov 03 '09 at 20:14

Edward Smit

1

2

don't you mean Advanced Linguistics Guide? – darasd Feb 15 '12 at 16:40

score 0 · Answer 3 · answered Oct 31 '17 at 02:58

Also you can install custom dictionaries available from MS support, then can provide the dictionary on each language. So if you install German, then the search engine will understand what you are trying to search, with the did you mean feature. You can enable into the search queries once you have the dictionary installed. Also don't forget to setup correctly the search schema with the proper character encoding for multi-language support. If the documents in the collection are not indexed with proper character encoding any effort you do at tokenization and query ends is useless.

Fast ESP character normalization

3 Answers3