0

I will be getting documents from a filtered query (quite a lot of documents). I will then immediately create an index from them (in Python, using requests to directly query the REST API), without any modification.

Is it possible to make this operation directly on the server, without the round-trip of data to the script and back?

Another question was similar (in the intent) and the only answer is to go via Logstash (equivalent to using my code, though possibly more efficient)

Community
  • 1
  • 1
WoJ
  • 27,165
  • 48
  • 180
  • 345

2 Answers2

1

refer http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html

in short what you need to do is 0.) ensure you have _source set to true

1.) use scan and scroll API , pass your filtered query with search type scan,

2.)fetch documents using scroll id

2.) bulk index the result using the source field which returns you the json used to index data

refer: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html

guide/en/elasticsearch/guide/current/bulk.html

guide/en/elasticsearch/guide/current/reindex.html

Ankur Goel
  • 196
  • 3
  • 1
    This is in essence what I was planning to do, per my question. How do one ensures the operation is done within elastiscearch, to avoid the fetch-and-push part? – WoJ Dec 14 '14 at 07:05
  • "done within elastiscearch" do you mean you you want avoid using network ? I don't think you can avoid network calls , even if you execute your program on server itself it will still communicate to ES via http ( even if you use transport client in java tyou are still using network ). – Ankur Goel Dec 15 '14 at 04:23
0

es 2.3 has an experimental feature that allows reindex from a query https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

fast tooth
  • 2,317
  • 4
  • 25
  • 34