Optimal Indexing strategy for Multilingual requirement using solr

Question

We use IBM WCS v7 for one of our e-commerce based requirement, in which Apache solr is embeded for the search based implementation.

As per a new requirement, there will be multiple language support for an website, ex- France version of the site can have support for english, french etc. (en_FR, fr_FR etc.) In order to configure solr with this interface, what should be the optimal indexing strategy using a single solr core ?

I got some ideas 1) using multiple fields in schema.xml for multiple languages, 2) using different solr cores for different languages.

But these approaches don't seem to be the best one fitting to the current requirement, as there will be 18 language support for the e-commerce website. Using different fields for every language will be very complicated, and also using different solr code is not a good approach as we need to apply the configurational change in all the solr cores if ever it happens as per any requirement.

Is there any other approaches, or is there any way I can associate the localeId to the indexed data and process the search result with respect to the detected language ?

Any help on this topic will be highly appreciated.

Thanks and Regards,

Jitendriya Dash

One clarification: Are you intending to localize the data stored in the index, or just support data being stored in different languages? Put another way, are you intending that each document's data will be indexed 18 times, once in each language? Or just once, in whatever language it happens to be in? — femtoRgon, Apr 17 '13 at 15:19
The data is already there in different language. We need to index it, for corresponding languages. However, seems like we go with one core per language approach, where we will create different solr cores for each language supported by the master catalog. With this approach, the configuration change needs to be replicated in each core, but the good part is, we don't need to think about language specific settings (like stopwords, protwords etc. can be handled separately with respect to different languages) — dash27, Apr 18 '13 at 06:12
Yes, in that case, I think you have the right idea already. Storing multiple languages in the same field will cause problems, which you seem already to have thought through (tokenization, stopwords, etc). Either of the two approaches you state could work well. Another possibility, you could also create separate documents for each language, passing the appropriate analyzer into the addDocument call, and add a field specifying the language of the document. You seem to be on the right track to me though. — femtoRgon, Apr 18 '13 at 15:53

score 1 · Answer 1 · answered Aug 12 '14 at 05:58

This post has already been answered by original poster and others- just summarizing that as an answer:

Recommended solution is to create one index core per locale/language. This is especially important if either the catalog or content (such as product name, description, keywords) will be different and business prefers to manage it separately for each locale. This gives the added benefit for Solr to perform its stemming and tokenization specific to that locale, if applicable.

I have been part of solutions where this approach was preferred over maintaining multiple fields or documents in the same core for each locale/language. Most number of index cores I have worked with is 6.

One must also remember that index core addition will require updates to supporting processes (Product Information Management system updates to catalog load to workspace management to stage-propagation to reindexing to cache invalidation).

Optimal Indexing strategy for Multilingual requirement using solr

1 Answers1