18

Just found that Solr 5 doesn't require a schema file to be predefined and it generates the schema, based on the indexing being performed. I would like to know how does this work in the background?

And whether it's a good practice or not? Is there any way to disable it?

Jakub Kotowski
  • 7,411
  • 29
  • 38
Krunal
  • 2,967
  • 8
  • 45
  • 101

2 Answers2

33

The schemaless feature has been in Solr since version 4.3. But it might be more stable only now as a concurrency issue with it was fixed in 4.10.

It is also called managed schema. When you configure Solr to use managed schema, Solr uses a special UpdateRequestProcessor to intercept document indexing requests and it guesses field types.

Solr starts with your schema.xml file and creates a new file called, by default, managed-schema to store all the inferred schema information. This file is automatically overwritten by Solr as it detects changes to the schema.

You should then use the Schema API if you want to make changes to the Schema. See also the Schemaless Mode documentation.

How to change Solr managed schema to classic schema

Stop Solr: $ bin/solr stop

Go to server/solr/mycore/conf, where "mycore" is the name of your core/collection.

Edit solrconfig.xml:

  • search for <schemaFactory class="ManagedIndexSchemaFactory"> and comment the whole element
  • search for <schemaFactory class="ClassicIndexSchemaFactory"/> and uncomment it
  • search for the <initParams> element that refers to add-unknown-fields-to-the-schema and comment out the whole <initParams>...</initParams>

Rename managed-schema to schema.xml and you are done.

You can now start Solr again: $ bin/solr start, go to http://localhost:8983/solr/#/mycore/documents and check that Solr now refuses to index a document with a new field not yet specified in schema.xml.

Is it a good practice? When to use it?

It depends on what you want. If you want to enforce a specific document structure (e.g. to make sure that all docs are "well-formed" according to your definition), then you want to use the classical schema management.

If on the other hand you don't know upfront what the doc structure is then you might want to use the schema-less feature.

Limits

While it is called schema-less, there are limits to the kinds of structures that you can index. This is true both for Solr and Elasticsearch, by the way. For example, if you first index this doc:

{"name":"John Doe"}

then you will get an error if you try to index a doc like that next:

{"name": {
   "first": "Daniel",
   "second": "Dennett"
   }
}

That is because in the first case the field name was of type string while in the second case it is an object.

If you would like to use indexing which goes beyond these limitations then you could use SIREn - it is an open source semi-structured information retrieval engine which is implemented as a plugin for both Solr and Elasticsearch. (Disclaimer: I worked for the company that develops SIREn)

Jakub Kotowski
  • 7,411
  • 29
  • 38
  • Thanks for answer. Can you help me understand how to disable schemaless mode? Any example would be great! – Krunal Apr 23 '15 at 12:33
  • 1
    @Krunal updated the answer with steps to change managed schema back to classic schema – Jakub Kotowski Apr 23 '15 at 13:50
  • Thanks for making an update! I will shortly test it and update the results here. – Krunal Apr 24 '15 at 14:09
  • 1
    I continue to recommend to my clients that they use the unmanaged schema. The reason is I tend to version control solr home directories. A schema.xml, solrconfig, etc file IMO should be treated as development artifacts. – Doug T. Aug 02 '16 at 14:26
  • 1
    it is the best answer I read about this subject in solr – saba safavi Mar 08 '21 at 05:59
1

This is so called schemaless mode in Solr. I don't know about internal details, how it's implemented, etc.

bin/solr start -e schemaless

This snippet above will start Solr in schemaless mode, if you don't do that, it will work as usual.

For more information on schemaless, take a look here - https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode

Mysterion
  • 9,050
  • 3
  • 30
  • 52
  • Hi, can you help me understand how to disable schemaless mode? I think its the default one, that started automatically. – Krunal Apr 23 '15 at 12:36
  • just run it as usual, or use classic schema factory – Mysterion Apr 23 '15 at 12:51
  • I'm running it as usual, but still it runs schemaless mode. Any idea how to use classic schema factory? Note that I'm running this on Windows – Krunal Apr 23 '15 at 13:06