0

We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)

We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.

To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.

The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.

Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.

This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.

We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.

However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.

Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.

If you've got this far, thank you! I have two questions:

  1. Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
  2. Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?

Thanks

i_like_robots
  • 2,760
  • 2
  • 19
  • 23
  • "indexed as nested objects" of your pieces of content? Why do you not use Parent-Child-relation? – Karsten R. Oct 26 '17 at 13:30
  • Why not indeed! I assume this would allow one-to-many relationships but would this have bad side effects as content is distributed between several shards? – i_like_robots Oct 26 '17 at 13:32
  • Child-docs are by definition on the same shard: You have to specify the parent if you want to create a child. So e.g. pieces of content is parent and tag per pieces of content is child. Is this the same for you "nested object"? – Karsten R. Oct 27 '17 at 05:36

1 Answers1

0

I've already answered a similar question to your use case of Nested datatype.

Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.

Try this. Do let me know if this solution helps in solving your problem.

Hatim Stovewala
  • 1,333
  • 10
  • 19
  • Hi Hatim, thank you for your answer - reading the docs about this it mentions "the parent document and all of its children must live on the same shard". With some tags so wide reaching do you think using a parent/child relation is a safe option? – i_like_robots Oct 26 '17 at 14:18