38

Wondering what are the best practice or experiences used for multilingual indexing and search in elasticsearch. I read through a number of resources, and as best as I can distill it the available options for indexing are:

  1. separate index per language;

  2. multi field type for multilingual field;

  3. separate field for all the possible languages.

So, wondering what are the side-effects for choosing one or the other of these options (or some other that I've missed). I guess having more indices does not really slow down the cluster (if it is not some huge number of languages), so not sure what would I get from choosing 2 or 3 except perhaps easier maintenance.

Any help welcomed!

ilijaluve
  • 1,050
  • 2
  • 10
  • 24
  • in my usecase only one language was mandatory so I used one index, had analyzers for each known language, put the _analyzer on the path of the language and kept the language relevant content in multifields: once for analyzing by language and once as default for "no language". – cfrick Mar 03 '14 at 18:02

4 Answers4

36

A bit old question, but the info might be helpful anyway. The index/mapping structure mainly depends on your usecase.
Do you need to use all the languages simultaneously or only one language is used at time?

  • Option 1: multilanguage website for example - the users only see and search in the current language they have chosen. In this case my experience is that index-per-lang would be good solution, especially if you need to be able to add and remove languages easily. The data amount is separated between the indices (performance benefit). Easy setup of analyzers for each language, especially if their settings differs only by the language name. Personally I'm currently using this option for one of my projects

General notes for options 2 and 3: Using one of those options gives you the ability to score the documents differently, based on the language as you can define scoring for each language field. You can add new fields to a mapping if you need to add more languages, but there is no way to remove or change the existing fields. Hence you will have to reindex all your content and set the property for the removed language to empty. You will need to add new analyzers for every new language. But it is required to close the index first and open it after the changes are made.

  • Option 2: If you need to search in all languages at once the multi-field gives you the easiest access as you can address all its sub-fields at once:

    "book_title": {
        "type": "multi_field",
        "fields": {
            "english": {
                "type": "string"
            },
            "german": {
                "type": "string"
            },
            "italian": {
                "type": "string"
            },
        }
    }

Here you can search in specific language (ex.: "book_title.english") or in all languages (using "book_title"). You should be careful not to update the field using "book_title" name, but using "book_title.[language]". Using "book_title" will lead to updating all the subfields with identical data (which is probably not what you want)

  • Option 3: Completely separate fields - you will need to put them all in the search query if you need to search as in option 2, more secure in terms of indexing as you cannot overwrite all the languages by mistake

  • Idea for option 4 - use type-per-language: can be used if you have only one type of documents. You can have different fields per language. Not useful if you have multiple document types

Shote
  • 529
  • 5
  • 5
  • i was thinking about the type 4 before i came to this post. my scenario contains multiple eshops, each with 1 or more languages. the only searchable document type is the product. i was considering folowing index/type/document structure: `.../eshopName/language/product[]`. do you think this could be a standard way to handle multilingual eshops with just product search? i must be able, however, to perform the searches either per language or per all languages which i should be able to get with `/eshop/en,de,fr/product` – ulkas May 27 '15 at 12:50
  • 4
    Option 4 shouldn't be used because it messes up index frequency, as stated here: https://www.elastic.co/guide/en/elasticsearch/guide/current/one-lang-docs.html – Lumbendil Nov 05 '15 at 17:10
  • How would one set the values for each of those fields? – IamIC Jan 30 '18 at 17:12
  • 1
    Just reading the conversation, when updating a document Elasticsearch internally will delete then create a new document. Option 2 warning is therefore not right imo. – Artholl Apr 18 '19 at 08:56
2

In case other people are looking for answers, here's a direct link to the documentation on the ElasticSearch site: https://www.elastic.co/guide/en/elasticsearch/guide/current/mixed-lang-fields.html

blockcipher
  • 2,144
  • 4
  • 22
  • 35
0

I would go with option 1 (separate index per language) as suggested by the Elasticsearch documentation since it makes sure you avoid term-frequency issues.

If your document contains multiple languages, you can put in multiple indices and use field collapsing query-time to avoid duplicates of the same document being returned.

Philip
  • 3,135
  • 2
  • 29
  • 43
0

I think it all depends on the use case. I think option 1 wont be optimal if we have multiple fields with mixed languages(locale) as there would be lot of redundant data for non localizable fields. Option 2 may be better in that case.

W. Itte
  • 21
  • 4