I have a 130M rows MongoDB 3.6.2.0 collection. It has several simple fields and 2 fields with nested JSON documents. Data is stored in compressed format (zlib).
I need to export one of embedded fields into JSON format as soon as possible. However, mongoexport is taking forever. After 12 hours of running it has processed only 5.5% of data, which is too slow for me.
The CPU is not busy. Mongoexport seems to be single-threaded.
Export command I am using:
mongoexport -c places --fields API \
--uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
-o D:\APIRecords.json
It's actually getMore command which is unreasonably slow under the hood:
2018-05-02T17:59:35.605-0700 I COMMAND [conn289] command maps.places command: getMore { getMore: 14338659261, collection: "places", $db: "maps" } originatingCommand: { find: "places", filter: {}, sort: {}, projection: { _id: 1, API: 1 }, skip: 0, snapshot: true, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:14338659261 keysExamined:0 docsExamined:5369 numYields:1337 nreturned:5369 reslen:16773797 locks:{ Global: { acquireCount: { r: 2676 } }, Database: { acquireCount: { r: 1338 } }, Collection: { acquireCount: { r: 1338 } } } protocol:op_query 22796ms
I have tried running multiple commands with --SKIP
and --LIMIT
options in separate processes like this
mongoexport -c places --SKIP 10000000 --LIMIT 10000000 --fields API \
--uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
-o D:\APIRecords.json
mongoexport -c places --SKIP 20000000 --LIMIT 10000000 --fields API \
--uri mongodb://user:pass@hostIP:hostPort/maps?authSource=admin \
-o D:\APIRecords.json
etc. But I was not able to finish waiting till the command with first non-zero SKIP even starts!
I have also tried with --forceTableScan
option, which did not make any difference.
I have no indexes on places table.
My storage configuration:
journal.enabled: false
wiredTiger.collectionConfig.blockCompressor: zlib
Collection stats:
'ns': 'maps.places',
'size': 2360965435671,
'count': 130084054,
'avgObjSize': 18149,
'storageSize': 585095348224.0
My server specs:
Windows Server 2012 R2 x64
10Gb RAM 4TB HDD 6 cores Xeon 2.2Ghz
I've run a test and with SSD it's having the same terrible read throughput as with HDD.
My question:
Why is reading so slow? Has anyone else experienced the same issue? Can you give me any hints on how to speed up data dumping?
Update
I moved the DB to fast NVME SSD drives and I think now I can state my concerns about MongoDB read performance in a more clear way.
Why does this command, which seeks to find a chunk of documents not having specific field:
2018-05-05T07:20:46.215+0000 I COMMAND [conn704] command maps.places command: find { find: "places", filter: { HTML: { $exists: false }, API.url: { $exists: true } }, skip: 9990, limit: 1600, lsid: { id: UUID("ddb8b02c-6481-45b9-9f84-cbafa586dbbf") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: COLLSCAN cursorid:15881327065 keysExamined:0 docsExamined:482851 numYields:10857 nreturned:101 reslen:322532 locks:{ Global: { acquireCount: { r: 21716 } }, Database: { acquireCount: { r: 10858 } }, Collection: { acquireCount: { r: 10858 } } } protocol:op_query 177040ms
only yields 50Mb/sec read pressure onto a fast flash drive? This is clearly performance of a single-threaded random (scattered) read. Whereas I have just proven that the drive allows 1Gb/sec read/write throughput easily.
In terms of Mongo internals, would it not be wiser to read BSON file in a sequential order and gain 20x scanning speed improvement? (And, since my blocks are zlib compressed, and server has 16 cores, better to decode fetched chunks in one or several helper threads?) Instead of iterating BSON document after document.
I also can confirm, even when I'm not specifying any query filters, and clearly want to iterate ENTIRE collection, fast sequential read of the BSON file is not happening.