8

With mapping types being removed in Elasticsearch 6.0 I wonder if IDs of documents are guaranteed to be unique across indices?

Say I have three indices, all with a "parent" field that contains an ID. Do I need to include which index the ID belongs to or can I just search through all three indices when looking for a document with the given ID?

Oskar Persson
  • 6,605
  • 15
  • 63
  • 124

2 Answers2

14

IDs are not unique across indices. If you want to refer to a document you need to know both the index name and the ID.

Tim
  • 6,406
  • 22
  • 34
  • 1
    Just to be sure before accepting as the answer, do you have a source for this? – Oskar Persson Feb 08 '18 at 08:40
  • 7
    We'll, I'm an engineer on the Elasticsearch team, so can I just quote myself as a source? But in any case, it's a direct implication from the fact that you can set the ID explicitly. Just test it out: create 3 indices, and in each one, put a document with an ID of 1. – Tim Mar 08 '18 at 11:01
  • @Tim is this still valid when using only auto-generated IDs? I mean, if we let ES generate every doc IDs should we expect to have duplicates? – sox supports the mods Jul 26 '19 at 10:11
  • 1
    Isn't this answer contradicted by the documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-id-field.html? In case the link gets broken, a quote from the doc: "Each document has an _id that uniquely identifies it". Nothing about potential clashes across indices there.Of course, we are in the next version now. – Maxim.K Mar 18 '21 at 10:04
  • You are reading more into the documentation than is intended. The `_id` only uniquely identifies a document within an index, it is not a unique identifier across a whole cluster (or globally). – Tim Mar 21 '21 at 03:51
  • This is really a bad architectural decision by Elastic because if you are black boxing the indices you should have a solution for uniqueness of ids. This is not even documented well in the data-stream page. – Kaveh Mar 13 '23 at 15:22
4

Explicit IDs

If you explicitly set the document ID when indexing, nothing prevents you from using the same ID twice for documents going in different indices.

Autogenerated IDs

If you don't set the ID when indexing, ES will generate one before storing the document.
According to the code, the ID is securely generated from a random number, the host MAC address and the current timestamp in ms. Additional work is done to ensure that the timestamp (and thus the ID sequence) increases monotonically.

To generate the same ID, when the JVM starts a specific random number has to be picked and the document ID must be generated in a specific moment with sub-millisecond precision. So while the chance exists, it's so small that I wouldn't care about it. (just like I wouldn't care about collisions when using an hash function to check file integrity)

Final note: as a code comment notes, the implementation is opaque and could change at any time, so what I wrote might not hold true in future versions.