I have a situation where someone else's application is passing data from database tables to elasticsearch for indexing. The challenge is that apart from two fields (schema and table), all of the field names are arbitrary. The data sent for indexing is from user nominated columns from tables being loaded into a data warehouse. The names of these "other" fields match the original column name in the source database table - which is why the field names are arbitrary.
The two fields named "schema" and "table" contain the name of the schema and the name of the table containing the data from the table sent for indexing. The data, not the name of the schema and table is what the users need to be able search.
So, my specific problem is that if there is a schema named, for example, "xyzzy" and the user wishes to search the content for "xyzzy" then all of the records will be matched because "schema" : "xyzzy" will match.
What I would really like to be able to do is search all of the fields except for "schema" and "table".
This answer is close to what I need to do: Exclude a field on a Elasticsearch query But the problem is that if the data is of the form:
{ "schema" : "xyzzy", "comment" : "say xyzzy to win a prize", result: "match" }
{ "schema" : "xyzzy", "comment" : "say syzygy to win a prize", result: "no match" }
{ "schema" : "other", "comment" : "say xyzzy to win a prize", result: "match" }
the second solution (in the referenced post), correctly returns the third record but not the first. the first solution, correctly returns the first and third record, but requires me to generate an arbitrary list of columns to search, each time I generate the search. This could become a problem as the number of fields being indexed grows.
What I'd really like to be able to do is search all fields "except: schema and table" for the string "xyzzy".
Is that possible? I've looked at changing the mapping definition (i.e. do not index table and schema) but that will require changes on the other application (which I have no control over) and it may affect other users of the elasticsearch service.
Here are some examples of what I've done so far (with slightly different field names). First, some sample data:
curl -XPUT 'localhost:9200/test-data/items/9997' -d '{
"schema": "test",
"table": "stones",
"name": "Bart ignores homer"
}'
curl -XPUT 'localhost:9200/test-data/items/9998' -d '{
"schema": "test",
"table": "stones",
"name": "Bart test for homer"
}'
curl -XPUT 'localhost:9200/test-data/items/9999' -d '{
"schema": "other",
"table": "stones",
"name": "Lisa passes test for high school"
}'
A query that works because all fields are listed as inclusions. This is based upon the first solution in the post referenced above. This will quickly become unworkable as more and more fields are added.
curl -XGET 'localhost:9200/test-data/_search?pretty=true' -d '{
"query" : {
"query_string": {
"fields": [ "addr", "eventname",
"name", "productdescription",
"productname", "productvendor", "suburb", "venuecity",
"venuename", "venuestate" ],
"query": "test"
}}
}'
A query that does not produce the correct result because it misses the second record with id 9998 "Bart test for Homer" (because the schema name is "test"). This is based upon the second answer in the post I referenced above.
curl -XGET 'localhost:9200/test-data/_search?pretty=true' -d '{
"query" : {
"query_string": {
"query": "test AND -schema:test"
}
}
}'
Elasticsearch version information is:
"number" : "5.5.0"
"lucene_version" : "6.6.0"