0

Trying to dump the _id column only. With mongo shell it finishes in around 2 minutes:

time mongoexport -h localhost -d db1 -c collec1 -f _id -o u.text --csv
connected to: localhost
exported 68675826 records
real    2m20.970s

With java it takes about 30 minutes: java -cp mongo-test-assembly-0.1.jar com.poshmark.Test

class Test {
public static void main(String[] args) {


    MongoClient mongoClient = new MongoClient("localhost");
    MongoDatabase database = mongoClient.getDatabase("db1");
    MongoCollection<Document> collection = database.getCollection("collec1");

    MongoCursor<Document> iterator = collection.find().projection(new Document("_id", 1)).iterator();
    while (iterator.hasNext()) {
        System.out.println(iterator.next().toString());
    }
}

}

CPU usage on box is low, don't see any network latency issues, since both tests are running from same box

Update:

Used Files.newBufferedWriter instead of System.out.println but ended up with same performance.

Looked at db.currentOp(), makes me think that mongo is hitting disk since it is having too many numYields

{
"inprog" : [
    {
        "desc" : "conn8636699",
        "threadId" : "0x79a70c0",
        "connectionId" : 8636699,
        "opid" : 1625079940,
        "active" : true,
        "secs_running" : 12,
        "microsecs_running" : NumberLong(12008522),
        "op" : "getmore",
        "ns" : "users.users",
        "query" : {
            "_id" : {
                "$exists" : true
            }
        },
        "client" : "10.1.166.219:60324",
        "numYields" : 10848,
        "locks" : {

        },
        "waitingForLock" : false,
        "lockStats" : {
            "Global" : {
                "acquireCount" : {
                    "r" : NumberLong(21696)
                },
                "acquireWaitCount" : {
                    "r" : NumberLong(26)
                },
                "timeAcquiringMicros" : {
                    "r" : NumberLong(28783)
                }
            },
            "MMAPV1Journal" : {
                "acquireCount" : {
                    "r" : NumberLong(10848)
                },
                "acquireWaitCount" : {
                    "r" : NumberLong(5)
                },
                "timeAcquiringMicros" : {
                    "r" : NumberLong(40870)
                }
            },
            "Database" : {
                "acquireCount" : {
                    "r" : NumberLong(10848)
                }
            },
            "Collection" : {
                "acquireCount" : {
                    "R" : NumberLong(10848)
                }
            }
        }
    }
  ]
}
Gaurav Shah
  • 5,223
  • 7
  • 43
  • 71
  • Operations by `mongoexport` and Java application are not same, I think. One is a command line utility tool and the other an application. If you run a query from `mongo` shell for the same Java code (as you have posted) - it _may be_ comparable. – prasad_ Mar 28 '20 at 05:22
  • Also, note that `mongoexport` and Java applications have different purposes. The export utility has a very limited and specialized functionality, _while_ Java programs cover a vast range of functions. – prasad_ Mar 28 '20 at 05:49
  • @prasad_ why do you think both are different ? one is a C application other is Java. Both of them run a query on Mongo server and get back the results. I understand the purpose are different but speed should not be vastly different. Even with underlying C optimizations it should not be very far from Java – Gaurav Shah Mar 28 '20 at 06:30
  • Your Java code doesn't have the functionality of the export program you have posted. It is not clear what you are trying to accomplish. – prasad_ Mar 28 '20 at 08:49
  • `System.out.println` has bad performance, try to write into file with Java and compare the results – Valijon Mar 28 '20 at 10:45
  • @prasad_ I am trying to dump the _id column. In both the approaches I am trying to do that. – Gaurav Shah Mar 29 '20 at 02:31

1 Answers1

2

The problem resides in STDOUT.

Printing to stdout is not inherently slow. It is the terminal you work with that is slow.

https://stackoverflow.com/a/3860319/3710490

The disk appears to be faster, because it is highly buffered.

The terminal, on the other hand, does little or no buffering: each individual print / write(line) waits for the full write (i.e. display to output device) to complete.

https://stackoverflow.com/a/3857543/3710490

I've reproduced your use case with enough similar dataset.

mongoexport to FILE

$ time "C:\Program Files\MongoDB\Server\4.2\bin\mongoexport.exe" -h localhost -d test -c collec1 -f _id -o u.text --csv
2020-03-28T13:03:01.550+0100    csv flag is deprecated; please use --type=csv instead
2020-03-28T13:03:02.433+0100    connected to: mongodb://localhost/
2020-03-28T13:03:03.479+0100    [........................]  test.collec1  0/21028330  (0.0%)
2020-03-28T13:05:02.934+0100    [########################]  test.collec1  21028330/21028330  (100.0%)
2020-03-28T13:05:02.934+0100    exported 21028330 records

real    2m1,936s
user    0m0,000s
sys     0m0,000s

mongoexport to STDOUT

$ time "C:\Program Files\MongoDB\Server\4.2\bin\mongoexport.exe" -h localhost -d test -c collec1 -f _id --csv
2020-03-28T14:43:16.479+0100    connected to: mongodb://localhost/
2020-03-28T14:43:16.545+0100    [........................]  test.collec1  0/21028330  (0.0%)
2020-03-28T14:53:02.361+0100    [########################]  test.collec1  21028330/21028330  (100.0%)
2020-03-28T14:53:02.361+0100    exported 21028330 records

real    9m45,962s
user    0m0,015s
sys     0m0,000s

JAVA to FILE

$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar FILE
Wasted time for [FILE] -  271,57 sec

real    4m32,174s
user    0m0,015s
sys     0m0,000s

JAVA to STDOUT to FILE

$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar SYSOUT > u.text

real    6m50,962s
user    0m0,015s
sys     0m0,000s

JAVA to STDOUT

$ time "C:\Program Files\Java\jdk1.8.0_211\bin\java.exe" -jar mongo-test-assembly-0.1.jar SYSOUT > u.text

Wasted time for [SYSOUT] -  709,33 sec

real    11m51,276s
user    0m0,000s
sys     0m0,015s

Java code

long init = System.currentTimeMillis();
try (MongoClient mongoClient = new MongoClient("localhost");
    BufferedWriter writer = Files.newBufferedWriter(Files.createTempFile("benchmarking", ".tmp"))) {

    MongoDatabase database = mongoClient.getDatabase("test");
    MongoCollection<Document> collection = database.getCollection("collec1");

    MongoCursor<Document> iterator = collection.find().projection(new Document("_id", 1)).iterator();
    while (iterator.hasNext()) {
        if ("SYSOUT".equals(args[0])) {
            System.out.println(iterator.next().get("_id"));

        } else {
            writer.write(iterator.next().get("_id") + "\n");
        }
    }
} catch (Exception e) {
    e.printStackTrace();
}
long end = System.currentTimeMillis();
System.out.println(String.format("Wasted time for [%s] -  %.2f sec", args[0], (end - init) / 1_000.));
Valijon
  • 12,667
  • 4
  • 34
  • 67
  • thank you. I did try printWriter, but that was slow as well. will give it a shot on newBufferedWriter. Also can you tell the mongo java driver version & mongo version that you are using ? – Gaurav Shah Mar 29 '20 at 02:31
  • Even with newBufferedWriter it takes the same amount of time. Goes till 30 mins. I think I have too many numYields and makes me think that mongo is somehow hitting disk than memory to dump _id field – Gaurav Shah Mar 29 '20 at 02:42
  • I think the issue lies in underlying hint approach: https://github.com/mongodb/mongo-tools/blob/master/mongoexport/mongoexport.go#L351 Mongo export adds a default _id hint on no query. will give it a shot – Gaurav Shah Mar 30 '20 at 03:19