9

I have a 130M rows MongoDB 3.6.2.0 collection. It has several simple fields and 2 fields with nested JSON documents. Data is stored in compressed format (zlib).

I need to export one of embedded fields into JSON format as soon as possible. However, mongoexport is taking forever. After 12 hours of running it has processed only 5.5% of data, which is too slow for me.

The CPU is not busy. Mongoexport seems to be single-threaded.

Export command I am using:

mongoexport -c places --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

It's actually getMore command which is unreasonably slow under the hood:

2018-05-02T17:59:35.605-0700 I COMMAND  [conn289] command maps.places command: getMore { getMore: 14338659261, collection: "places", $db: "maps" } originatingCommand: { find: "places", filter: {}, sort: {}, projection: { _id: 1, API: 1 }, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:14338659261 keysExamined:0 docsExamined:5369 numYields:1337 nreturned:5369 reslen:16773797 locks:{ Global: { acquireCount: { r: 2676 } }, Database: { acquireCount: { r: 1338 } }, Collection: { acquireCount: { r: 1338 } } } protocol:op_query 22796ms

I have tried running multiple commands with --SKIP and --LIMIT options in separate processes like this

mongoexport -c places --SKIP 10000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json
mongoexport -c places --SKIP 20000000 --LIMIT 10000000 --fields API \
    --uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
    -o D:\APIRecords.json

etc. But I was not able to finish waiting till the command with first non-zero SKIP even starts!

I have also tried with --forceTableScan option, which did not make any difference.

I have no indexes on places table.

My storage configuration:

journal.enabled: false
wiredTiger.collectionConfig.blockCompressor: zlib

Collection stats:

'ns': 'maps.places',
'size': 2360965435671,
'count': 130084054,
'avgObjSize': 18149,
'storageSize': 585095348224.0

My server specs:

Windows Server 2012 R2 x64
10Gb RAM 4TB HDD 6 cores Xeon 2.2Ghz

I've run a test and with SSD it's having the same terrible read throughput as with HDD.

My question:

Why is reading so slow? Has anyone else experienced the same issue? Can you give me any hints on how to speed up data dumping?

Update

I moved the DB to fast NVME SSD drives and I think now I can state my concerns about MongoDB read performance in a more clear way.

Why does this command, which seeks to find a chunk of documents not having specific field:

2018-05-05T07:20:46.215+0000 I COMMAND  [conn704] command maps.places command: find { find: "places", filter: { HTML: { $exists: false }, API.url: { $exists: true } }, skip: 9990, limit: 1600, lsid: { id: UUID("ddb8b02c-6481-45b9-9f84-cbafa586dbbf") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:15881327065 keysExamined:0 docsExamined:482851 numYields:10857 nreturned:101 reslen:322532 locks:{ Global: { acquireCount: { r: 21716 } }, Database: { acquireCount: { r: 10858 } }, Collection: { acquireCount: { r: 10858 } } } protocol:op_query 177040ms

only yields 50Mb/sec read pressure onto a fast flash drive? This is clearly performance of a single-threaded random (scattered) read. Whereas I have just proven that the drive allows 1Gb/sec read/write throughput easily.

In terms of Mongo internals, would it not be wiser to read BSON file in a sequential order and gain 20x scanning speed improvement? (And, since my blocks are zlib compressed, and server has 16 cores, better to decode fetched chunks in one or several helper threads?) Instead of iterating BSON document after document.

I also can confirm, even when I'm not specifying any query filters, and clearly want to iterate ENTIRE collection, fast sequential read of the BSON file is not happening.

halfer
  • 19,824
  • 17
  • 99
  • 186
Anatoly Alekseev
  • 2,011
  • 24
  • 27
  • I use Mongo, but I don't have any knowledge of exporting to be able to help. However, something to try: if you have found that the export binary is single-threaded, could you kick off several parallel exports, with each one specifying a different query? I don't know if that would result in disk thrashing that would make your export speed worse, or whether the operation is sufficiently single-core-bound that it would help. – halfer May 13 '18 at 21:41
  • Thanks for you comment halfer, yes I tried running parallel exports each skipping different amount of records (10M,20M, etc), but it turned out Mongo can't skip records without actually crawling them one by one (correct me if I am wrong) which resulted in only 1st export process being active and the rest being 'hanging' ( – Anatoly Alekseev May 14 '18 at 10:53
  • I think index info is not relevant here as I want full collection scan, not a scan of some subset of the collection... – Anatoly Alekseev May 14 '18 at 13:01
  • 1
    Rather than skips, can you construct a query on a string or numeric field to divide the data into ranges? For example, if there is string data in `places`, then items starting with `A` first, then `B`, etc. They might be both parallelisable and benefit from indexing. – halfer May 14 '18 at 13:34
  • Oh. That's true, but to build such an index (i don't have any) it would require to do full collection scan anyway first (internally). Thanks for the suggestion. – Anatoly Alekseev May 14 '18 at 17:13
  • If you just query this data, how fast can you get it? This will help you decide if `mongoexport` or bad query plan is at fault. – Imaskar May 15 '18 at 09:12
  • Imaskar it seems that problem is generally in a way how compressed collection is being read/decompressed by mongo. – Anatoly Alekseev May 15 '18 at 09:19
  • Your MongoDB instance seems to be heavily used in parallel which is why you have a high `numYields` in your log. Try reducing the load to speed up that bugger. Also, could there be a slow network connection in between your client and your MongoDB server? – dnickless May 16 '18 at 21:13
  • Also, what you could try is to use `mongodump` followed by a `bsondump`. Not sure if that helps but it would probably give you some more insight into what is really the bottleneck. – dnickless May 16 '18 at 21:19
  • @AnatolyAlekseev is the disk saturated? Is there any bottleneck that you can see? Is the database being used while this export operation is ongoing? – kevinadi May 17 '18 at 06:59
  • No not used and disks were far from saturation if we speak about sequential reads. But as said disk utilization was 50Mb/sec which is close to random read performance of the disk. However I don't know what's the rationale of traversing the collection in some (random?) order, my query did not contain any ordering clause. – Anatoly Alekseev May 17 '18 at 07:19
  • @AnatolyAlekseev it is possible that the data is not physically laid out sequentially on the disk, thus leading to random reads. If the collection is active, then it's very likely it's not sequential physically. Having said that, random access comprised the vast majority of database use cases. It is extremely rare that a database would perform a strictly linear operation. – kevinadi May 17 '18 at 07:42
  • Look, when exporting, I need full collection scan, ie accessing every document in collection regardless of documents order. I don't see why Mongodb can't read bson file blocks sequentially at full disk speed while decoding documents present in each next block. It would be so natural thing to do... – Anatoly Alekseev May 17 '18 at 14:16
  • @Anatoly, I have the same problem exact behavior for the mongodump. Have you find out some way how to improve the speed.?Mongodump also doesn't read the collection sequentially. – HPCS Aug 06 '19 at 13:48
  • @HPCS No, I have not, unfortunately. – Anatoly Alekseev Aug 07 '19 at 17:20

5 Answers5

6

There are many factors that are limiting the export performance.

  • The data size is relatively large compared to available memory: ~2 TB vs. ~5 GB WiredTiger cache (if set to default). That is:
    • The whole WiredTiger cache can only contain at best ~0.22% of the collection, in reality it's very likely much less than this since the cache would contain data from other collections and indexes.
    • This means that WiredTiger needs to fetch from disk very frequently, while evicting the current content of the cache. If the replica set is being actively used, this would mean evicting "dirty" data from the cache and persisting them to disk, which would take time.
    • Note that documents inside the WiredTiger cache are not compressed.
  • The collection contains large documents, of which you need only one part of it. This means that extra time is required to process the documents.
  • The collection is compressed with zlib, meaning that extra time must be used to uncompress the documents.
  • The readPreference is secondaryPreferred, meaning that it will try to read from a secondary. If the replica set is being actively written to, oplog apply operations on the secondary will block readers. This will add further delay.

One possible improvement is that if this is an operation that you do frequently, creating an index on the relevant fields and exporting it using a covered query could improve performance since the index would be smaller than the full documents.

Edit: Running mongoexport in parallel may be helpful in this case:

Further from the additional information provided, I ran a test that seems to alleviate this issue somewhat.

It seems that running mongoexport in parallel, where each mongoexport handling a subset of the collection might be able to speed up the export.

To do this, divide the _id namespace corresponding to the number of mongoexport process you plan to run.

For example, if I have 200,000 documents, starting with _id:0 to _id:199,999 and using 2 mongoexport processes:

mongoexport -q '{"_id":{"$gte":0, "$lt":100000}}' -d test -c test > out1.json &
mongoexport -q '{"_id":{"$gte":100000, "$lt":200000}}' -d test -c test > out2.json &

where in the above example, the two mongoexport processes are each handling one half of the collection.

Testing this workflow with 1 process, 2 processes, 4 processes, and 8 processes I arrive at the following timings:

Using 1 process:

real    0m32.720s
user    0m33.900s
sys 0m0.540s

2 processes:

real    0m16.528s
user    0m17.068s
sys 0m0.300s

4 processes:

real    0m8.441s
user    0m8.644s
sys 0m0.140s

8 processes:

real    0m5.069s
user    0m4.520s
sys 0m0.364s

Depending on the available resources, running 8 mongoexport processes in parallel seems to speed up the process by a factor of ~6. This was tested in a machine with 8 cores.

Note: halfer's answer is similar in idea, although this answer basically tries to see if there's any benefit in calling mongoexport in parallel.

kevinadi
  • 13,365
  • 3
  • 33
  • 49
  • Thanks for suggestions, cache size should not matter for this task 'cause every document needs to be accessed only once. Yes collection was compressed but only 1 core was busy decompressing it, total CPU load was ~5%. There were no other operations during export. – Anatoly Alekseev May 17 '18 at 14:21
  • 1
    It does matter. WiredTiger loads the data from disk, uncompresses it, and loads it into its cache. "Cache" is the WiredTiger term for its working memory. The slow export you're seeing is symptomatic of an underprovisioned hardware. – kevinadi May 17 '18 at 22:14
  • Kevin, average document size is only 16Kb. Server had 60Gb RAM 50% of it was devoted to WiredTiger Cache. But even 5Gb RAM would be more than enough to read hundreds thousands 16Kb documents from disk, process them and overwrite them in cache IF MongoDB had some efficient internal way to do it, but it does not. – Anatoly Alekseev May 17 '18 at 22:21
  • 10GB or 60GB? I'm confused, you said 10GB in your question. – kevinadi May 17 '18 at 22:26
  • Yes old server is 10Gb one, I also tried second more powerful server with NVME SSD disks and 16 cores /60 Gb RAM – Anatoly Alekseev May 17 '18 at 22:49
  • Do you see the same performance issue using the new server? Are the timings comparable? – kevinadi May 17 '18 at 23:03
  • Yes the absolutely same case: on both servers disk read pressure is the one of random reads (~1.5Mb/s and ~50Mb/s respectively) – Anatoly Alekseev May 17 '18 at 23:46
  • I believe I have the beginnings of a solution. Please let me know if it works on your hardware. – kevinadi May 18 '18 at 06:10
  • @halfer yes you're right. My answer is just confirming that it can be done and how much benefit can be achieved. Credit is due, so it's yours :) Having said that it's unlikely that this will be done by MongoDB due to two reasons: 1) parallelization of export requires you to know the range & distribution of `_id`, which is not likely known by the database itself (MongoDB does not currently keep statistics of every field, even `_id`, since it's customizable). 2) exporting is not a typical database operation, so optimizing it would yield marginal benefit. – kevinadi May 18 '18 at 10:31
  • Guys i'm sorry for the response delay, I should have mentioned that my _id field is custom alphanumeric of length ~30 (appears to be random). I can't divide it like that using $gte/$lt approach. I think I only can use .skip.limit approach but with chunks of size 10M it's not working as described – Anatoly Alekseev May 20 '18 at 15:29
  • 1
    @AnatolyAlekseev: you should still be able to run comparisons on it. Assuming it's ordinary Latin characters, take a look at the distribution of strings beginning with A, B, C, etc. You can still use gt/lt on strings (I don't know about Mongo specifically, but strings _are_ comparable generally). – halfer May 20 '18 at 15:50
  • 2
    @halfer good catch, fixed. Btw strings are comparable in MongoDB. – kevinadi May 20 '18 at 22:06
  • No that won't work. No means to guarantee that I can get data evenly split into reasonable number of chunks, sorry. Is there any way to traverse index itself, to identify it's 'pivotal points' so to say, and then apply your suggestion? Like, to find 16 points of an index which divide whole _id range into more or less same number of rows? – Anatoly Alekseev May 21 '18 at 05:58
  • Let me check if I can use some simple logic of dichotomy search and find().count() calls to find desired splitting points. – Anatoly Alekseev May 21 '18 at 06:06
  • But anyway. Even if that will be possible, data was for sure not inserted in the order sorted by _id. So, even if we manage to find correct splitting points, indexes returned for each separate query will point to documents scattered across the disk, therefore hotspot will be random disk read access again, no? – Anatoly Alekseev May 21 '18 at 06:16
  • @AnatolyAlekseev: I don't think you're approaching this with the right frame of mind, and I say that on the off-chance it will be helpful. You're determined this not fixable and then not trying as a consequence. In my experience - and I don't claim to be able to fix all problems - problems are much more solvable if one is (a) determined and (b) excited to solve them. Excise the idea that it is Mongo's fault completely if you can. – halfer May 28 '18 at 11:44
  • @AnatolyAlekseev: there are several more things you could try. One thing that comes to mind is an ongoing writer process that categorises your data into chunks. It will periodically scan for documents, and if it is not marked, it will give it a category (e.g. 1-20). It will then also update a count of the category awarded, with a view to keeping all documents categorised, with a roughly equal number of documents in each category. It can periodically do counts across the whole database to correct its own (out of date) counts. – halfer May 28 '18 at 11:47
  • That approach would then give your documents simple filters that could be used in parallel exports. Of course, you could also use the ongoing writer system to build a mirror of your data instead, in the format you want to export it in. It might always lag behind a bit, but that's the nature of such systems. You have an enterprise-type challenge here, and the solution may have to be based on a set of engineering trade-offs. – halfer May 28 '18 at 11:50
3

You can try using the pandas and joblib library to export to the JSON file in parts. You can refer to this gist for processing the data in MongoDB.

from pandas import DataFrame
from joblib import Parallel,delayed

def process(idx,cursor):
    file_name = "fileName"+str(idx)+".json"
    df = DataFrame(list(cursor))
    df.to_json(file_name, orient='records')

#make a list of cursors.. you can read the parallel_scan api of pymongo

cursors = mongo_collection.parallel_scan(no_of_parts_of_collection)
Parallel(n_jobs=4)(delayed(process)(idx,cursor) for idx,cursor in enumerate(cursors)

The n_jobs parameter shall spawn processes equal to the number specified. Each process shall hold one core. I used 4 since your server has 6 cores available. The parallel_scan() api takes a number and divides the collection into the parts equal to the number provided. You can try higher numbers to break the collection into evenly divided cursors.

I have tried a similar approach but the signature of and definition of my process function was different. I was able to process 2.5M records in under 20 mins. You can read this answer of mine to get an idea as to what exactly I was trying to achieve.

Yayati Sule
  • 1,601
  • 13
  • 25
  • That's very interesting Yayati I will give parallel_scan a try! – Anatoly Alekseev May 17 '18 at 14:25
  • 2
    except that parallel collection scan was only supported under MMAPV1 and this will give you one cursor with wiredTiger – Asya Kamsky May 19 '18 at 00:28
  • Indeed, Asya is right, for WiredTiger this won't work (at least in 3.6) according to https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/: This command will not return more than one cursor for the WiredTiger storage engine – Anatoly Alekseev May 21 '18 at 05:48
1

I don't work with Mongo, but a common trick could be used: make a simple app that efficiently sequentially queries all the data, filter it and save in the format you want.

If you need to save in a complex format and there are no libraries to work with it (I really doubt that), it might still be efficient to read all, filter, put it back in a temporary collection, export that collection fully, drop temporary collection.

Imaskar
  • 2,773
  • 24
  • 35
  • Does it meant I need to create my own bson format parser? Thanks, not the best solution for me ) – Anatoly Alekseev May 15 '18 at 09:21
  • No, I mean query all rows from the collection outside. Like `db.places.find()` from any client library. – Imaskar May 15 '18 at 09:25
  • But that's what's I am doing now. If you look at my last code block. This does not help and read pressure on the disk system is the one of random read, not sequential. I can not check if it's the same then collection is not compressed though. – Anatoly Alekseev May 15 '18 at 13:10
  • In the code you provided you do filter on `--fields API`, and thats the case I think. And better it be so. Because the other possibility is that the query plan is right, but it decompresses the data in the same thread that reads it and in that case I doubt you can do anything other than modify mongo code. – Imaskar May 15 '18 at 13:15
  • Ok, I read the last edit of the question. Sad story. It means, that mongo's sequential read is inefficient. – Imaskar May 15 '18 at 13:20
  • Exactly ( Yes looks lite it reads and decompresses on the same thread and does not allow to skip N records easily when N is big. Yes I also thought filter was a possible cause but even without --fields it did not go better. – Anatoly Alekseev May 15 '18 at 13:28
0

mongoexport is a client library and it uses public API and socket connection to the mongodb itself.

So it doesn't have access to the document BSON on disk

Does this look like what you mentioned?

https://docs.mongodb.com/manual/core/backups/#back-up-by-copying-underlying-data-files

mongodump also could be an option for you

Andrii Muzalevskyi
  • 3,261
  • 16
  • 20
0
import pymongo
import json
import time

myclient = pymongo.MongoClient("mongodb://localhost:27017/")
mydb = myclient["databasename"]
mycol = mydb["collectionname"]

c = mycol.find({}, {'_id': False})
start = time.time()

BATCH_SIZE = 10000
i=0
while True:
    c2 = c.clone()
    a = c2[(BATCH_SIZE * i):(BATCH_SIZE * (i+1))]
    resolve_a = [x for x in a]
    
    with open(f"export_{i}.json", "w+") as f:
        json.dump(resolve_a, f)

    if len(resolve_a) < BATCH_SIZE:
        break
    i += 1
    total_done = BATCH_SIZE * i
    time_elapsed = (time.time()-start)
    efficiency = total_done / time_elapsed
    print(f"[{time_elapsed:.2f}s] completed {total_done} - {efficiency:.2f} jobs/s")
print("done")

mongodb default export takes like 3 seconds per 1000 while this code was 50 seconds for 100k so 6x faster for some very simple code. You can change BATCH_SIZE depending on how much RAM you have and how many big chungus files you want.

Kelvin Wang
  • 627
  • 5
  • 21