1

I have to insert 10 million documents in CouchDB localhost. I have used a python script to create random data in this format:

{
"docs": [
  {"_id": "0", "integer": 0, "string": "0"},
  {"_id": "1", "integer": 1, "string": "1"},
  {"_id": "2", "integer": 2, "string": "2"}
  ]
}

The size of the file is 1.5 GB as I have 10 key-value pairs in each document.

I am using this command to load the json file:

curl -d @db.json -H "Content-type: application/json" -X POST http://127.0.0.1:5984/new/_bulk_docs

For 100,000 documents it tool 10-15 seconds to load, but for 10,000,000 it didn't even load in 12 hours.

Any help on how can I bulk insert in couchDB will be appreciated.

TIA

knowrahulj
  • 53
  • 1
  • 12

2 Answers2

2

Finally, I have divided my file into 100 files each having 0.1 M records and uploaded to the database through this command.

FOR /L %i IN (0,1,9) DO (
    curl -d @dbn%i.json -H "Content-type: application/json" -X POST http://127.0.0.1:5984/new4/_bulk_docs
)
knowrahulj
  • 53
  • 1
  • 12
1

I'm not familiar with the CouchDB bulk API, but you have mentioned that a bulk request with 100'000 records worked, so I suspect 10'000'000 is just too much for a single go.

Consider splitting your big file of 10'000'000 records into smaller JSON files of 100'000 records and posting each chunk/batch with a separate request:

import json

# Batch function from: https://stackoverflow.com/a/8290508/7663649
def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

BATCH_SIZE = 100000
with open("db.json") as input_file:
    for batch_index, batch_list in enumerate(
            batch(json.load(input_file), BATCH_SIZE)):
        with open("chunk_{}.json".format(batch_index), "w") as chunk_file:
            json.dump(batch_list, chunk_file)
Pierre
  • 1,068
  • 1
  • 9
  • 13
  • Thanks for the answer, I already did it by splitting in 100*100000 files. However it was looking for something faster like Bulk insert in MySQL – knowrahulj Jun 03 '18 at 08:58
  • The only way to mimic MySQL-style bulk insert would be to delete all y our views (to disable index updating), do your inserts, then re-create the views. – Jonathan Hall Jun 06 '18 at 16:37