Elasticsearch - using ngrams as a tokenizer and filter gives different outputs

Question

Can someone explain why using ngrams as a tokenzier gives a different output compared to when using it as a filter. For example using it as a tokenizer for "Paracetamol" I get:

{
   "tokens": [
      {
         "token": "par",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "para",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "parac",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "parace",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paraceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "ara",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "arac",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "arace",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "araceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "rac",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "race",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "raceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "ace",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "ceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "eta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "etam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "etamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "etamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "tam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "tamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "tamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "amo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "amol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "mol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      }
   ]
}

Where as using it as a filter I get:

{
   "tokens": [
      {
         "token": "par",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      },
      {
         "token": "para",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 2
      },
      {
         "token": "parac",
         "start_offset": 0,
         "end_offset": 5,
         "type": "word",
         "position": 3
      },
      {
         "token": "parace",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 4
      },
      {
         "token": "paracet",
         "start_offset": 0,
         "end_offset": 7,
         "type": "word",
         "position": 5
      },
      {
         "token": "paraceta",
         "start_offset": 0,
         "end_offset": 8,
         "type": "word",
         "position": 6
      },
      {
         "token": "paracetam",
         "start_offset": 0,
         "end_offset": 9,
         "type": "word",
         "position": 7
      },
      {
         "token": "paracetamo",
         "start_offset": 0,
         "end_offset": 10,
         "type": "word",
         "position": 8
      },
      {
         "token": "paracetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 9
      },
      {
         "token": "ara",
         "start_offset": 1,
         "end_offset": 4,
         "type": "word",
         "position": 10
      },
      {
         "token": "arac",
         "start_offset": 1,
         "end_offset": 5,
         "type": "word",
         "position": 11
      },
      {
         "token": "arace",
         "start_offset": 1,
         "end_offset": 6,
         "type": "word",
         "position": 12
      },
      {
         "token": "aracet",
         "start_offset": 1,
         "end_offset": 7,
         "type": "word",
         "position": 13
      },
      {
         "token": "araceta",
         "start_offset": 1,
         "end_offset": 8,
         "type": "word",
         "position": 14
      },
      {
         "token": "aracetam",
         "start_offset": 1,
         "end_offset": 9,
         "type": "word",
         "position": 15
      },
      {
         "token": "aracetamo",
         "start_offset": 1,
         "end_offset": 10,
         "type": "word",
         "position": 16
      },
      {
         "token": "aracetamol",
         "start_offset": 1,
         "end_offset": 11,
         "type": "word",
         "position": 17
      },
      {
         "token": "rac",
         "start_offset": 2,
         "end_offset": 5,
         "type": "word",
         "position": 18
      },
      {
         "token": "race",
         "start_offset": 2,
         "end_offset": 6,
         "type": "word",
         "position": 19
      },
      {
         "token": "racet",
         "start_offset": 2,
         "end_offset": 7,
         "type": "word",
         "position": 20
      },
      {
         "token": "raceta",
         "start_offset": 2,
         "end_offset": 8,
         "type": "word",
         "position": 21
      },
      {
         "token": "racetam",
         "start_offset": 2,
         "end_offset": 9,
         "type": "word",
         "position": 22
      },
      {
         "token": "racetamo",
         "start_offset": 2,
         "end_offset": 10,
         "type": "word",
         "position": 23
      },
      {
         "token": "racetamol",
         "start_offset": 2,
         "end_offset": 11,
         "type": "word",
         "position": 24
      },
      {
         "token": "ace",
         "start_offset": 3,
         "end_offset": 6,
         "type": "word",
         "position": 25
      },
      {
         "token": "acet",
         "start_offset": 3,
         "end_offset": 7,
         "type": "word",
         "position": 26
      },
      {
         "token": "aceta",
         "start_offset": 3,
         "end_offset": 8,
         "type": "word",
         "position": 27
      },
      {
         "token": "acetam",
         "start_offset": 3,
         "end_offset": 9,
         "type": "word",
         "position": 28
      },
      {
         "token": "acetamo",
         "start_offset": 3,
         "end_offset": 10,
         "type": "word",
         "position": 29
      },
      {
         "token": "acetamol",
         "start_offset": 3,
         "end_offset": 11,
         "type": "word",
         "position": 30
      },
      {
         "token": "cet",
         "start_offset": 4,
         "end_offset": 7,
         "type": "word",
         "position": 31
      },
      {
         "token": "ceta",
         "start_offset": 4,
         "end_offset": 8,
         "type": "word",
         "position": 32
      },
      {
         "token": "cetam",
         "start_offset": 4,
         "end_offset": 9,
         "type": "word",
         "position": 33
      },
      {
         "token": "cetamo",
         "start_offset": 4,
         "end_offset": 10,
         "type": "word",
         "position": 34
      },
      {
         "token": "cetamol",
         "start_offset": 4,
         "end_offset": 11,
         "type": "word",
         "position": 35
      },
      {
         "token": "eta",
         "start_offset": 5,
         "end_offset": 8,
         "type": "word",
         "position": 36
      },
      {
         "token": "etam",
         "start_offset": 5,
         "end_offset": 9,
         "type": "word",
         "position": 37
      },
      {
         "token": "etamo",
         "start_offset": 5,
         "end_offset": 10,
         "type": "word",
         "position": 38
      },
      {
         "token": "etamol",
         "start_offset": 5,
         "end_offset": 11,
         "type": "word",
         "position": 39
      },
      {
         "token": "tam",
         "start_offset": 6,
         "end_offset": 9,
         "type": "word",
         "position": 40
      },
      {
         "token": "tamo",
         "start_offset": 6,
         "end_offset": 10,
         "type": "word",
         "position": 41
      },
      {
         "token": "tamol",
         "start_offset": 6,
         "end_offset": 11,
         "type": "word",
         "position": 42
      },
      {
         "token": "amo",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 43
      },
      {
         "token": "amol",
         "start_offset": 7,
         "end_offset": 11,
         "type": "word",
         "position": 44
      },
      {
         "token": "mol",
         "start_offset": 8,
         "end_offset": 11,
         "type": "word",
         "position": 45
      }
   ]
}

http://stackoverflow.com/questions/31398617/how-edge-ngram-token-filter-differs-from-ngram-token-filter — undefined_variable, Nov 02 '15 at 11:29
@undefined_variable, the question you linked to is completely different than the one asked here. This question is about **nGram filter** vs **nGram tokenizer**; you linked to a question about **edge nGram filter** vs **nGram filter**. — Paul, Nov 03 '16 at 17:56
@Paul there is a comment in the answer that it stands true even for token filter — undefined_variable, Nov 04 '16 at 09:08
@undefined_variable, my point was that this question is about filter vs tokenizer, the question you linked to is ngram vs edge ngram. — Paul, Nov 04 '16 at 12:48

score 0 · Answer 1 · answered Oct 24 '17 at 21:31

These two approaches may results equal outputs.
But Depending on the circumstances one approach may be better than the other.
If you need special characters in your search terms, you will probably need to use the ngram tokenizer in your mapping. It's useful to know how to use both.
Reference

Elasticsearch - using ngrams as a tokenizer and filter gives different outputs

1 Answers1