6

I have Documents that I'd like to make searchable in 3 different languages. Since I can have multiple fields with the same name/type, the following Document structure works (this is a simplified example).

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="en",
        value="dog"),
      search.TextField(
        name="name",
        language="es",
        value="perro"),
      search.TextField(
        name="name",
        language="fr",
        value="chien")
    ]
  )
  index = search.Index("my_index")
  index.put(document)

Specifying the language helps Google tokenize the value of the TextField.

The following queries all work, each returning one result:

print index.search("name: dog")
print index.search("name: perro")
print index.search("name: chien")

Here is my question: Can I restrict a search to only target fields with a specific language?

The purpose is to avoid getting false positive results. Since each language uses the Arabic alphabet, it's possible that someone performing a full text search in Spanish may see English results that are not relevant.

Thank you.

Aaron Drenberg
  • 1,117
  • 2
  • 11
  • 29
  • calling google translate api for language detection and using result in query: `get_index(lang-detected).search(query)` or translating search term to stored data language and searching based on translation result – Hadi Farhadi Jul 01 '17 at 06:14

2 Answers2

2

You can use facets to add fields to a document that don't actually appear in the document (metadata). These would indicate what languages appear in the document.

Document insertion:

    index = search.Index("my_index")
    document = search.Document(
        fields=[
          search.TextField(
            name="name",
            language="en",
            value="dog"),
          search.TextField(
            name="name",
            language="es",
            value="perro"),
          search.TextField(
            name="name",
            language="fr",
            value="chien")
        ],
        facets=[
           search.AtomFacet(name='lang', value='en'),
           search.AtomFacet(name='lang', value='es'),
           search.AtomFacet(name='lang', value='fr'),
        ],
      )
    index.put(document)
    document = search.Document(
        fields=[
          search.TextField(
            name="name",
            language="es",
            value="gato"),
          search.TextField(
            name="name",
            language="fr",
            value="chat")
        ],
        facets=[
           # no english in this document so leave out lang='en'
           search.AtomFacet(name='lang', value='es'),
           search.AtomFacet(name='lang', value='fr'),
        ],
      )
    index.put(document)

Query:

index = search.Index("my_index")
query = search.Query(
    '', # query all documents, cats and dogs.
    # filter docs by language facet
    facet_refinements=[
        search.FacetRefinement('lang', value='en'),
    ])

results = index.search(query)
for doc in results:
    result = {}
    for f in doc.fields:
        # filter fields by language
        if f.language == 'en':
            result[f.name] = f.value
    print result

Should print {u'name': u'dog'}.

Note that although we can fetch only documents that have english in them, we still have to filter out the fields in other languages in those documents. This why we iterate through the fields only adding those in english to result.

If you want to know more about the more general use case for faceted search, this answer gives a pretty good idea.

Frank Wilson
  • 3,192
  • 1
  • 20
  • 29
  • Turns out you need to use FacetRefinements instead of FacetRequests. The former is to select documents by facets the latter only gives you information on what facets are available. – Frank Wilson Jun 24 '17 at 09:56
  • document = search.Document( doc_id=str("1"), fields=[ search.TextField(language="en", name="name", value="one"), search.TextField(language="es", name="name", value="uno") ]) index.put(document) document = search.Document( doc_id=str("2"), fields=[ search.TextField(language="en", name="name", value="uno"), search.TextField(language="es", name="name", value="one") ]) index.put(document) index.search(search.Query( "name: one", facet_refinements=[ search.FacetRefinement("lang", value="en") ])) – Aaron Drenberg Jun 26 '17 at 23:42
  • The above code is quite gross, but SO won't let me format it in the comment. It's a case where FacetRefinement returns zero results, despite having a match. Do you know why? – Aaron Drenberg Jun 26 '17 at 23:42
  • @user326502 that's because you didn't add a `facets` parameter (or accompanying `AtomFacet`s to your document). – Frank Wilson Jun 27 '17 at 18:30
  • When I add `AtomFacets` to my document, then query with a `FacetRefinement` of `lang="en"`, both documents are returned. Which isn't really what I'm looking for. I'm trying to filter out the documents where the field has a match, but the language does not. – Aaron Drenberg Jun 28 '17 at 15:20
  • To clarify, I'm trying to search only the English fields, and none of the other ones. – Aaron Drenberg Jun 28 '17 at 15:21
  • @user326502 the idea is to put only AtomFacet(name='lang', value='en'), in documents where you have `TextField` with language='en'. – Frank Wilson Jun 28 '17 at 15:47
  • @user326502 I extended my example to provide clarification – Frank Wilson Jun 28 '17 at 16:02
  • Thanks. Each document will have all 3 of the same languages. I suppose I could add separate documents for each distinct language. That changes how I planned on assigning `doc_id` but that's not a big deal. – Aaron Drenberg Jul 01 '17 at 16:09
2

You could use a separate index for each language.

Define a utility function for resolving the correct index for a given language:

def get_index(lang):
   return search.Index("my_index_{}".format(lang))

Insert documents:

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="en",
        value="dog"),
    ])

get_index('en').put(document)

document = search.Document(
    fields=[
      search.TextField(
        name="name",
        language="fr",
        value="chien")
    ])

get_index('fr').put(document)

Query by language:

query = search.Query(
    'name: chien')

results = get_index('fr').search(query)

for doc in results:
    print doc
Frank Wilson
  • 3,192
  • 1
  • 20
  • 29
  • I took a similar approach by using separate fields for each language, and then appending the language code to the search field name. That's my fallback approach, but I'm hoping to find a better solution here. – Aaron Drenberg Jul 01 '17 at 16:06