1

Can someone explain why using ngrams as a tokenzier gives a different output compared to when using it as a filter. For example using it as a tokenizer for "Paracetamol" I get:

{
   "tokens": [
      {
         "token": "par",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "para",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "parac",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "parace",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paraceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "paracetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "ara",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "arac",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "arace",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "araceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aracetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "rac",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "race",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "raceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "racetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "ace",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "aceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "acetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cet",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "ceta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cetam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cetamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "cetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "eta",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "etam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "etamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "etamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "tam",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "tamo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "tamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "amo",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "amol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      },
      {
         "token": "mol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 1
      }
   ]
}

Where as using it as a filter I get:

{
   "tokens": [
      {
         "token": "par",
         "start_offset": 0,
         "end_offset": 3,
         "type": "word",
         "position": 1
      },
      {
         "token": "para",
         "start_offset": 0,
         "end_offset": 4,
         "type": "word",
         "position": 2
      },
      {
         "token": "parac",
         "start_offset": 0,
         "end_offset": 5,
         "type": "word",
         "position": 3
      },
      {
         "token": "parace",
         "start_offset": 0,
         "end_offset": 6,
         "type": "word",
         "position": 4
      },
      {
         "token": "paracet",
         "start_offset": 0,
         "end_offset": 7,
         "type": "word",
         "position": 5
      },
      {
         "token": "paraceta",
         "start_offset": 0,
         "end_offset": 8,
         "type": "word",
         "position": 6
      },
      {
         "token": "paracetam",
         "start_offset": 0,
         "end_offset": 9,
         "type": "word",
         "position": 7
      },
      {
         "token": "paracetamo",
         "start_offset": 0,
         "end_offset": 10,
         "type": "word",
         "position": 8
      },
      {
         "token": "paracetamol",
         "start_offset": 0,
         "end_offset": 11,
         "type": "word",
         "position": 9
      },
      {
         "token": "ara",
         "start_offset": 1,
         "end_offset": 4,
         "type": "word",
         "position": 10
      },
      {
         "token": "arac",
         "start_offset": 1,
         "end_offset": 5,
         "type": "word",
         "position": 11
      },
      {
         "token": "arace",
         "start_offset": 1,
         "end_offset": 6,
         "type": "word",
         "position": 12
      },
      {
         "token": "aracet",
         "start_offset": 1,
         "end_offset": 7,
         "type": "word",
         "position": 13
      },
      {
         "token": "araceta",
         "start_offset": 1,
         "end_offset": 8,
         "type": "word",
         "position": 14
      },
      {
         "token": "aracetam",
         "start_offset": 1,
         "end_offset": 9,
         "type": "word",
         "position": 15
      },
      {
         "token": "aracetamo",
         "start_offset": 1,
         "end_offset": 10,
         "type": "word",
         "position": 16
      },
      {
         "token": "aracetamol",
         "start_offset": 1,
         "end_offset": 11,
         "type": "word",
         "position": 17
      },
      {
         "token": "rac",
         "start_offset": 2,
         "end_offset": 5,
         "type": "word",
         "position": 18
      },
      {
         "token": "race",
         "start_offset": 2,
         "end_offset": 6,
         "type": "word",
         "position": 19
      },
      {
         "token": "racet",
         "start_offset": 2,
         "end_offset": 7,
         "type": "word",
         "position": 20
      },
      {
         "token": "raceta",
         "start_offset": 2,
         "end_offset": 8,
         "type": "word",
         "position": 21
      },
      {
         "token": "racetam",
         "start_offset": 2,
         "end_offset": 9,
         "type": "word",
         "position": 22
      },
      {
         "token": "racetamo",
         "start_offset": 2,
         "end_offset": 10,
         "type": "word",
         "position": 23
      },
      {
         "token": "racetamol",
         "start_offset": 2,
         "end_offset": 11,
         "type": "word",
         "position": 24
      },
      {
         "token": "ace",
         "start_offset": 3,
         "end_offset": 6,
         "type": "word",
         "position": 25
      },
      {
         "token": "acet",
         "start_offset": 3,
         "end_offset": 7,
         "type": "word",
         "position": 26
      },
      {
         "token": "aceta",
         "start_offset": 3,
         "end_offset": 8,
         "type": "word",
         "position": 27
      },
      {
         "token": "acetam",
         "start_offset": 3,
         "end_offset": 9,
         "type": "word",
         "position": 28
      },
      {
         "token": "acetamo",
         "start_offset": 3,
         "end_offset": 10,
         "type": "word",
         "position": 29
      },
      {
         "token": "acetamol",
         "start_offset": 3,
         "end_offset": 11,
         "type": "word",
         "position": 30
      },
      {
         "token": "cet",
         "start_offset": 4,
         "end_offset": 7,
         "type": "word",
         "position": 31
      },
      {
         "token": "ceta",
         "start_offset": 4,
         "end_offset": 8,
         "type": "word",
         "position": 32
      },
      {
         "token": "cetam",
         "start_offset": 4,
         "end_offset": 9,
         "type": "word",
         "position": 33
      },
      {
         "token": "cetamo",
         "start_offset": 4,
         "end_offset": 10,
         "type": "word",
         "position": 34
      },
      {
         "token": "cetamol",
         "start_offset": 4,
         "end_offset": 11,
         "type": "word",
         "position": 35
      },
      {
         "token": "eta",
         "start_offset": 5,
         "end_offset": 8,
         "type": "word",
         "position": 36
      },
      {
         "token": "etam",
         "start_offset": 5,
         "end_offset": 9,
         "type": "word",
         "position": 37
      },
      {
         "token": "etamo",
         "start_offset": 5,
         "end_offset": 10,
         "type": "word",
         "position": 38
      },
      {
         "token": "etamol",
         "start_offset": 5,
         "end_offset": 11,
         "type": "word",
         "position": 39
      },
      {
         "token": "tam",
         "start_offset": 6,
         "end_offset": 9,
         "type": "word",
         "position": 40
      },
      {
         "token": "tamo",
         "start_offset": 6,
         "end_offset": 10,
         "type": "word",
         "position": 41
      },
      {
         "token": "tamol",
         "start_offset": 6,
         "end_offset": 11,
         "type": "word",
         "position": 42
      },
      {
         "token": "amo",
         "start_offset": 7,
         "end_offset": 10,
         "type": "word",
         "position": 43
      },
      {
         "token": "amol",
         "start_offset": 7,
         "end_offset": 11,
         "type": "word",
         "position": 44
      },
      {
         "token": "mol",
         "start_offset": 8,
         "end_offset": 11,
         "type": "word",
         "position": 45
      }
   ]
}
Val
  • 207,596
  • 13
  • 358
  • 360
Imran Azad
  • 1,008
  • 2
  • 12
  • 30
  • 3
    http://stackoverflow.com/questions/31398617/how-edge-ngram-token-filter-differs-from-ngram-token-filter – undefined_variable Nov 02 '15 at 11:29
  • 2
    @undefined_variable, the question you linked to is completely different than the one asked here. This question is about **nGram filter** vs **nGram tokenizer**; you linked to a question about **edge nGram filter** vs **nGram filter**. – Paul Nov 03 '16 at 17:56
  • @Paul there is a comment in the answer that it stands true even for token filter – undefined_variable Nov 04 '16 at 09:08
  • 1
    @undefined_variable, my point was that this question is about filter vs tokenizer, the question you linked to is ngram vs edge ngram. – Paul Nov 04 '16 at 12:48

1 Answers1

0

These two approaches may results equal outputs.
But Depending on the circumstances one approach may be better than the other.
If you need special characters in your search terms, you will probably need to use the ngram tokenizer in your mapping. It's useful to know how to use both.
Reference

S.M.Mousavi
  • 5,013
  • 7
  • 44
  • 59