65

I am trying to bulk index a JSON file into a new Elasticsearch index and am unable to do so. I have the following sample data inside the JSON

[{"Amount": "480", "Quantity": "2", "Id": "975463711", "Client_Store_sk": "1109"},
{"Amount": "2105", "Quantity": "2", "Id": "975463943", "Client_Store_sk": "1109"},
{"Amount": "2107", "Quantity": "3", "Id": "974920111", "Client_Store_sk": "1109"},
{"Amount": "2115", "Quantity": "2", "Id": "975463798", "Client_Store_sk": "1109"},
{"Amount": "2116", "Quantity": "1", "Id": "975463827", "Client_Store_sk": "1109"},
{"Amount": "648", "Quantity": "3", "Id": "975464139", "Client_Store_sk": "1109"},
{"Amount": "2126", "Quantity": "2", "Id": "975464805", "Client_Store_sk": "1109"},
{"Amount": "2133", "Quantity": "1", "Id": "975464061", "Client_Store_sk": "1109"},
{"Amount": "1339", "Quantity": "4", "Id": "974919458", "Client_Store_sk": "1109"},
{"Amount": "1196", "Quantity": "5", "Id": "974920538", "Client_Store_sk": "1109"},
{"Amount": "1198", "Quantity": "4", "Id": "975463638", "Client_Store_sk": "1109"},
{"Amount": "1345", "Quantity": "4", "Id": "974919522", "Client_Store_sk": "1109"},
{"Amount": "1347", "Quantity": "2", "Id": "974919563", "Client_Store_sk": "1109"},
{"Amount": "673", "Quantity": "2", "Id": "975464359", "Client_Store_sk": "1109"},
{"Amount": "2153", "Quantity": "1", "Id": "975464511", "Client_Store_sk": "1109"},
{"Amount": "3896", "Quantity": "4", "Id": "977289342", "Client_Store_sk": "1109"},
{"Amount": "3897", "Quantity": "4", "Id": "974920602", "Client_Store_sk": "1109"}]

I am using

 curl -XPOST localhost:9200/index_local/my_doc_type/_bulk --data-binary --data @/home/data1.json 

When I try to use the standard bulk index API from Elasticsearch I get this error

 error: {"message":"ActionRequestValidationException[Validation Failed: 1: no requests added;]"}

Can anyone help with indexing this type of JSON?

B--rian
  • 5,578
  • 10
  • 38
  • 89
Amit P
  • 924
  • 2
  • 11
  • 15

4 Answers4

108

What you need to do is to read that JSON file and then build a bulk request with the format expected by the _bulk endpoint, i.e. one line for the command and one line for the document, separated by a newline character... rinse and repeat for each document:

curl -XPOST localhost:9200/your_index/_bulk -d '
{"index": {"_index": "your_index", "_type": "your_type", "_id": "975463711"}}
{"Amount": "480", "Quantity": "2", "Id": "975463711", "Client_Store_sk": "1109"}
{"index": {"_index": "your_index", "_type": "your_type", "_id": "975463943"}}
{"Amount": "2105", "Quantity": "2", "Id": "975463943", "Client_Store_sk": "1109"}
... etc for all your documents
'

Just make sure to replace your_index and your_type with the actual index and type names you're using.

UPDATE

Note that the command-line can be shortened, by removing _index and _type if those are specified in your URL. It is also possible to remove _id if you specify the path to your id field in your mapping (note that this feature will be deprecated in ES 2.0, though). At the very least, your command line can look like {"index":{}} for all documents but it will always be mandatory in order to specify which kind of operation you want to perform (in this case index the document)

UPDATE 2

curl -XPOST localhost:9200/index_local/my_doc_type/_bulk --data-binary  @/home/data1.json

/home/data1.json should look like this:

{"index":{}}
{"Amount": "480", "Quantity": "2", "Id": "975463711", "Client_Store_sk": "1109"}
{"index":{}}
{"Amount": "2105", "Quantity": "2", "Id": "975463943", "Client_Store_sk": "1109"}
{"index":{}}
{"Amount": "2107", "Quantity": "3", "Id": "974920111", "Client_Store_sk": "1109"}

UPDATE 3

You can refer to this answer to see how to generate the new json style file mentioned in UPDATE 2.

UPDATE 4

As of ES 7.x, the doc_type is not necessary anymore and should simply be _doc instead of my_doc_type. As of ES 8.x, the doc type will be removed completely. You can read more about this here

Val
  • 207,596
  • 13
  • 358
  • 360
  • I get the format that you have mentioned, but i wanted to ask if i can have a workaround so that I do not have to specify something like this {"index": {"_index": "your_index", "_type": "your_type", "_id": "975463711"}} after each document in json ? – Amit P Oct 26 '15 at 07:19
  • 2
    The command line is always mandatory for each document. If you add the index and type name in your URL (i.e. `localhost:9200/your_index/your_type/_bulk`), you can remove `_index` and `_type` from the command line to shorten it. There's also a way to not have to specify `_id` but at the very least, you'll always need to specify what operation you want to perform with the document, i.e. the shortest you can do is `{"index":{}}` – Val Oct 26 '15 at 07:24
  • So the request curl -XPOST localhost:9200/index_local/my_doc_type/_bulk --data-binary --data @/home/data1.json where i have specified the index and the doc_type in the request , my json data should look like this? [{"index":{}} {"Amount": "480", "Quantity": "2", "Id": "975463711", "Client_Store_sk": "1109"}, {"index":{}} {"Amount": "2105", "Quantity": "2", "Id": "975463943", "Client_Store_sk": "1109"}, {"index":{}} {"Amount": "2107", "Quantity": "3", "Id": "974920111", "Client_Store_sk": "1109"} ] – Amit P Oct 26 '15 at 08:16
  • 1
    You don't need to specify your JSON objects inside an array (i.e. `[...]`) and no commas between documents, just one JSON per line with newline characters at the end of each line (don't forget a newline after the last line). I've updated my answer with your latest code. – Val Oct 26 '15 at 08:22
  • 1
    @Val Am i correct then in saying that you cannot simply pass in a .json object and that it needs to be parsed / transformed first (i.e. each item on it's own line and an added `index` header for each item? If so, is there a known tool that can be used to do this automatically? I ask cause I have a json file that contains 10 000 items, I assumed I would be able to pass the entire document in, however, I was quickly corrected. – Hexie Aug 09 '17 at 06:52
  • 1
    @Hexie in your case, you can use UPDATE 2 above and a shell script one-liner to update your file and add the header line. – Val Aug 09 '17 at 06:57
  • 1
    @Val Thanks for the quick response, but again that would mean manual manipulation right? Whereby I would need to update the json file to contain a new `header` row about each of the 10 000 entries? I can give you an example of my json data if you'd like but simply put would follow this: [ { "fileName": "filename", "data":"massive string text data here" } ] x 10 000 (PS - i say 10 000 but this will have ten's of millions, i'm using thousands for a POC i'm busy with) – Hexie Aug 09 '17 at 07:04
  • 1
    @Hexie I suggest you create a new question (possibly referencing this one) so other people might benefit from it. – Val Aug 09 '17 at 07:14
  • 2
    @Val Fair suggestion - new question created: https://stackoverflow.com/questions/45601344/elasticsearch-bulk-json-data – Hexie Aug 09 '17 at 22:05
  • doc type seems to be deprecated now. POST Body of update 2 is unchanged but new url (which i find much more logic) is localhost:9200/index_local/_bulk – Harry Sep 27 '21 at 09:33
  • @Harry Good point, that answer was written a while ago. I've added another update. Thanks – Val Sep 27 '21 at 09:39
  • Thats great! i just fear update 4 is not 100% correct as _doc is the doc_type and they want you now to completely toss it. E.g. when using /myindex/_doc/_bulk instead of /myindex/_bulk one gets a deprecation error in reply: #! Deprecation: [types removal] Specifying types in bulk requests is deprecated. – Harry Sep 27 '21 at 10:30
  • @Harry yes, but deprecation doesn't mean error, it's just not necessary to specify it, but not an error to do so. `_doc` is going to go away in ES 8 which is due in a few months from now. I've added some more context. – Val Sep 27 '21 at 11:06
  • @Val can we add alias too in the index line? – Scorpy Feb 07 '23 at 13:43
  • @Scorpy only if the alias points to a single index underneath – Val Feb 07 '23 at 13:49
  • Yes the bulk is for the same index but when I add alias to the index line command it says unknown parameter "_alias" – Scorpy Feb 07 '23 at 14:35
  • @Scorpy you need to call it `_index` not `_alias` – Val Feb 07 '23 at 14:48
  • Oh no sorry i just thought the way we pass "_index" "_type" we could pass "_alias". – Scorpy Feb 07 '23 at 15:20
15

As of today, 6.1.2 is the latest version of ElasticSearch, and the curl command that works for me on Windows (x64) is

curl -s -XPOST localhost:9200/my_index/my_index_type/_bulk -H "Content-Type: 
application/x-ndjson" --data-binary @D:\data\mydata.json

The format of the data that should be present in mydata.json remains the same as shown in @val's answer

Thomas
  • 372
  • 2
  • 10
  • A side note for `Content-Type` that we should not include `charset` in it as a bug will return HTTP 406 https://github.com/elastic/elasticsearch/issues/28123 – lk_vc Jun 17 '19 at 10:36
3

A valid Elasticsearch bulk API request would be something like (ending with a newline):

POST http://localhost:9200/products_slo_development_temp_2/productModel/_bulk

{ "index":{ } } 
{"RequestedCountry":"slo","Id":1860,"Title":"Stol"} 
{ "index":{ } } 
{"RequestedCountry":"slo","Id":1860,"Title":"Miza"} 

Elasticsearch bulk api documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

This is how I do it

I send a POST http request with the uri valiable as the URI/URL of the http request and elasticsearchJson variable is the JSON sent in the body of the http request formatted for the Elasticsearch bulk api:

var uri = @"/" + indexName + "/productModel/_bulk";
var json = JsonConvert.SerializeObject(sqlResult);
var elasticsearchJson = GetElasticsearchBulkJsonFromJson(json, "RequestedCountry");

Helper method for generating the required json format for the Elasticsearch bulk api:

public string GetElasticsearchBulkJsonFromJson(string jsonStringWithArrayOfObjects, string firstParameterNameOfObjectInJsonStringArrayOfObjects)
{
  return @"{ ""index"":{ } } 
" + jsonStringWithArrayOfObjects.Substring(1, jsonStringWithArrayOfObjects.Length - 2).Replace(@",{""" + firstParameterNameOfObjectInJsonStringArrayOfObjects + @"""", @" 
{ ""index"":{ } } 
{""" + firstParameterNameOfObjectInJsonStringArrayOfObjects + @"""") + @"
";
}

The first property/field in my JSON object is the RequestedCountry property that's why I use it in this example.

productModel is my Elasticsearch document type. sqlResult is a C# generic list with products.

tedi
  • 6,350
  • 5
  • 52
  • 67
2

This answer is for Elastic Search 7.x onwards. _type is deprecated. As others have mentioned, you can read the file programatically, and construct a request body as described below. Also, I see that each of your json object has the Id attribute. So, you could set the document's internal id (_id) to be the same as this attribute. Updated _bulk API would look like this:

HTTP Method: POST

URI: /<index_name>/_bulk

Request body (should end with a new line):

{"index":{"_id": "975463711"}}
{"Amount": "480", "Quantity": "2", "Id": "975463711", "Client_Store_sk": "1109"}
{"index":{"_id": "975463943"}}
{"Amount": "2105", "Quantity": "2", "Id": "975463943", "Client_Store_sk": "1109"}
Binita Bharati
  • 5,239
  • 1
  • 43
  • 24