12

I have "documents" (activerecords) with an attribute called deviations. The attribute has values like "Bin X" "Bin $" "Bin q" "Bin %" etc.

I am trying to use tire/elasticsearch to search the attribute. I am using the whitespace analyzer to index the deviation attribute. Here is my code for creating the indexes:

settings :analysis => {
    :filter  => {
      :ngram_filter => {
        :type => "nGram",
        :min_gram => 2,
        :max_gram => 255
      },
      :deviation_filter => {
        :type => "word_delimiter",
        :type_table => ['$ => ALPHA']
      }
    },
    :analyzer => {
      :ngram_analyzer => {
        :type  => "custom",
        :tokenizer  => "standard",
        :filter  => ["lowercase", "ngram_filter"]
      },
      :deviation_analyzer => {
        :type => "custom",
        :tokenizer => "whitespace",
        :filter => ["lowercase"]
      }
    }
  } do
    mapping do
      indexes :id, :type => 'integer'
      [:equipment, :step, :recipe, :details, :description].each do |attribute|
        indexes attribute, :type => 'string', :analyzer => 'ngram_analyzer'
      end
      indexes :deviation, :analyzer => 'whitespace'
    end
  end

The search seems to work fine when the query string contains no special characters. For example Bin X will return only those records that have the words Bin AND X in them. However, searching for something like Bin $ or Bin % shows all results that have the word Bin almost ignoring the symbol (results with the symbol do show up higher in the search that results without).

Here is the search method I have created

def self.search(params)
    tire.search(load: true) do
        query { string "#{params[:term].downcase}:#{params[:query]}", default_operator: "AND" }
        size 1000
    end
end

and here is how I am building the search form:

<div>
    <%= form_tag issues_path, :class=> "formtastic issue", method: :get do %>
        <fieldset class="inputs">
        <ol>
            <li class="string input medium search query optional stringish inline">
                <% opts = ["Description", "Detail","Deviation","Equipment","Recipe", "Step"] %>
                <%= select_tag :term, options_for_select(opts, params[:term]) %>
                <%= text_field_tag :query, params[:query] %>
                <%= submit_tag "Search", name: nil, class: "btn" %>
            </li>
        </ol>
        </fieldset>
    <% end %>
</div>
Stephen Ostermiller
  • 23,933
  • 14
  • 88
  • 109
Arnob
  • 467
  • 1
  • 4
  • 13
  • You don't just escape the characters have a meaning to Lucene with a backslash? Of course, in a Ruby string you'd need a double backslash \\ to escape the ruby character before it hits the Elastic Search api. I've not tried Tire, so I don't know if it works in your world. FYI, here is a quick reference to the characters affected: http://docs.lucidworks.com/display/lweug/Escaping+Special+Syntax+Characters – Phil Apr 26 '13 at 13:39
  • I don't think this is the issue because queries Bin $ or Bin % are affected, but they are not listed in the link above as a special character. – Arnob Apr 26 '13 at 17:48
  • I know from my own experience of full text search in databases (Oracle I think it was, and MySQL for LIKE tests in varchar or text fields) that % is a 'match everything' character. Maybe that link above is incomplete, or maybe its not relevant to your issue. Have you tried escaping to see if that solves the problem? – Phil Apr 27 '13 at 18:34
  • Escaping the special characters with \ (for example Bin \%) or \\ (for example Bin \\%) has no effect on the behavior. – Arnob Apr 30 '13 at 20:13

1 Answers1

30

You can sanitize your query string. Here is a sanitizer that works for everything that I've tried throwing at it:

def sanitize_string_for_elasticsearch_string_query(str)
  # Escape special characters
  # http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#Escaping Special Characters
  escaped_characters = Regexp.escape('\\/+-&|!(){}[]^~*?:')
  str = str.gsub(/([#{escaped_characters}])/, '\\\\\1')

  # AND, OR and NOT are used by lucene as logical operators. We need
  # to escape them
  ['AND', 'OR', 'NOT'].each do |word|
    escaped_word = word.split('').map {|char| "\\#{char}" }.join('')
    str = str.gsub(/\s*\b(#{word.upcase})\b\s*/, " #{escaped_word} ")
  end

  # Escape odd quotes
  quote_count = str.count '"'
  str = str.gsub(/(.*)"(.*)/, '\1\"\3') if quote_count % 2 == 1

  str
end

params[:query] = sanitize_string_for_elasticsearch_string_query(params[:query])
Robert Kajic
  • 8,689
  • 4
  • 44
  • 43
  • 3
    I needed to add forward slash also to the `escaped_characters` array. `escaped_characters = Regexp.escape('\\+-&|!(){}[]^~*?:\/')` as it was breaking for strings with forward slash. – rubyprince Jun 27 '13 at 12:19
  • That's strange since `/` is not a special character in Lucene: http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html#Escaping%20Special%20Characters – Robert Kajic Jun 27 '13 at 13:19
  • Hi, please see http://50.16.250.253:9200/locations/location/_search?q=123%2F345 ..I think this is giving an error, because `/` is inside the string...when I escape with a `\\`, the error is resolved, http://50.16.250.253:9200/locations/location/_search?q=123%5C%2F345 – rubyprince Jul 01 '13 at 11:58
  • 1
    Hi, quote escape regexp should be `str = str.gsub(/(.*)"(.*)/, '\1\"\2') if quote_count % 2 == 1`, because there is just two groups – kalifs Aug 01 '13 at 10:27
  • 2
    Just want to note: forward slash is now a special character and should be escaped. http://lucene.apache.org/core/4_6_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Escaping_Special_Characters – Dmitry Jan 16 '14 at 06:03
  • I've translated this solution to Scala here: http://stackoverflow.com/questions/32107601/is-there-an-implementation-of-a-search-term-sanitizer-for-elasticsearch-in-scala/32107602 – Zoltán Aug 20 '15 at 19:54
  • 1
    I've translated the solution to Python here https://gist.github.com/eranhirs/5c9ef5de8b8731948e6ed14486058842 – Eran H. Dec 24 '16 at 22:00