4

I have a 15GB file with more than 25 milion rows, which is in this json format(which is accepted by mongodb for importing:

[
    {"_id": 1, "value": "\u041c\..."}
    {"_id": 2, "value": "\u041d\..."}
    ...
]

When I'm trying to import it in mongodb with the following command I get speed of only 50 rows per second which is really slow for me.

mongoimport --db wordbase --collection sentences --type json --file C:\Users\Aleksandar\PycharmProjects\NLPSeminarska\my_file.json -jsonArray

When I tried to insert the data into the collection by using python with pymongo the speed was even worse. I also tried increasing the priority of the process but it didn't make any difference.

The next thing that I tried is the same thing but without using -jsonArray and although I got a big speed increase(~4000/sec), it said that the BSON representation of the supplied JSON is too large.

I also tried splitting the file into 5 separate files and importing them from separate consoles into the same collection, but I get speed decrease of all of them to about 20 documents/sec.

While I searched all over the web I saw that people had speeds of over 8K documents/sec and I can't see what do I do wrong.

Is there a way to speed this thing up, or should I convert the whole json file to bson and import it that way, and if so which is the correct way to do both the converting and the importing?

Huge thanks.

Aleksandar
  • 51
  • 3
  • 2
    You do realize that your syntax is wrong in the first place as it is only a valid option with `--jsonArray`. The the next point is that there is a 16MB limit imposed on data that can be "slurped" in at once in this way, since there is a BSON Limit. Bottom line here is remove the wrapping `[]` bracket chars from your input file, then make sure every line is terminated by as newline `\n` char after the wrapping document braces `{}`.. Ultimately break up your file once processed this way and run several processes in parallel. It's a 15GB file. What did you expect? Millisecond response? – Neil Lunn Jan 11 '15 at 09:11
  • Also highly off topic. Stack overflow is for programming topics only. This is better suited to [dba.stackexchange.com](http://dba.stackexchange.com) and where you should have posted this in the first place. – Neil Lunn Jan 11 '15 at 09:13
  • 1
    @NeilLunn , thank you very much for your response, removing the [] brackets and not using --jsonArray made the import go with about 8500 documents per second. And thanks for pointing me to the right site, have a nice day. – Aleksandar Jan 11 '15 at 12:12
  • @buncis consider to use this method https://stackoverflow.com/questions/49808581/using-jq-how-can-i-split-a-very-large-json-file-into-multiple-files-each-a-spec?rq=1 – buncis Sep 30 '19 at 12:07

1 Answers1

2

I have the exact same problem with a 160Gb dump file. It took me two days to load 3% of the original file with -jsonArray and 15 minutes with these changes.

First, remove the initial [ and trailing ] characters:

sed 's/^\[//; s/\]$/' -i filename.json

Then import without the -jsonArray option:

mongoimport --db "dbname" --collection "collectionname" --file filename.json

If the file is huge, sed will take a really long time and maybe you run into storage problems. You can use this C program instead (not written by me, all glory to @guillermobox):

int main(int argc, char *argv[])
{
    FILE * f;
    const size_t buffersize = 2048;
    size_t length, filesize, position;
    char buffer[buffersize + 1];

    if (argc < 2) {
        fprintf(stderr, "Please provide file to mongofix!\n");
        exit(EXIT_FAILURE);
    };

    f = fopen(argv[1], "r+");

    /* get the full filesize */
    fseek(f, 0, SEEK_END);
    filesize = ftell(f);

    /* Ignore the first character */
    fseek(f, 1, SEEK_SET);

    while (1) {
        /* read chunks of buffersize size */
        length = fread(buffer, 1, buffersize, f);
        position = ftell(f);

        /* write the same chunk, one character before */
        fseek(f, position - length - 1, SEEK_SET);
        fwrite(buffer, 1, length, f);

        /* return to the reading position */
        fseek(f, position, SEEK_SET);

        /* we have finished when not all the buffer is read */
        if (length != buffersize)
            break;
    }

    /* truncate the file, with two less characters */
    ftruncate(fileno(f), filesize - 2);

    fclose(f);

    return 0;
};

P.S.: I don't have the power to suggest a migration of this question but I think this could be helpful.

nessa.gp
  • 1,804
  • 21
  • 20