Best way to index large file

Question

I have a file with about 100gb with a word:tag per line. I want to index these on word to easily get the list of tags for a given word.

I wanted to save this on boltdb (mainly to checkout boltdb) but random write access is bad so I was aiming to index the file in some other way first, then moving all of it to boltdb without need to check for duplicates or de/serialisation of the tag list

So, for reference, if I simply read the file into memory (discarding data), I get about 8 MB/s.

If I write to boltdb files using code such as

        line := ""
        linesRead := 0
        for scanner.Scan() {
            line = scanner.Text()
            linesRead += 1
            data := strings.Split(line, ":")

            err = bucket.Put([]byte(data[0]), []byte(data[1]))
            logger.FatalErr(err)
            // commit on every N lines
            if linesRead % 10000 == 0 {
                err = tx.Commit()
                logger.FatalErr(err)
                tx, err = db.Begin(true)
                logger.FatalErr(err)
                bucket = tx.Bucket(name)
            }
        }

I get about 300 Kb/s speed and this is not even complete (as it's not adding tag to each word, only stores the last occurrence). So adding the array and JSON serialisation would definitely lower that speed...

So I gave mongodb a try

        index := mgo.Index{
            Key: []string{"word"},
            Unique: true,
            DropDups: false,
            Background: true,
            Sparse: true,
        }
        err = c.EnsureIndex(index)
        logger.FatalErr(err)

        line := ""
        linesRead := 0
        bulk := c.Bulk()

        for scanner.Scan() {
            line = scanner.Text()
            data := strings.Split(line, ":")
            bulk.Upsert(bson.M{"word": data[0]}, bson.M{"$push": bson.M{"tags": data[1]}})
            linesRead += 1

            if linesRead % 10000 == 0 {
                _, err = bulk.Run()
                logger.FatalErr(err)
                bulk = c.Bulk()
            }
        }

And I get about 300 Kb/s as well (though Upsert and $push here handle appending to the list).

I tried with a local MySQL instance as well (indexed on word) but speed was like 30x slower...

with sqlite there is a good post at https://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite — RichieK, Feb 03 '19 at 17:05
I tried MySQL and it was worst than mongo, so SQLite wasn’t really an option.. in-memory either as it’s a 100gb file... — Filipe Pina, Feb 03 '19 at 21:15
@FilipePina, by speed you mean the writing speed to the database ? If so, do you have the app, the database, and file on the same machine ? For MongoDB, you could try inserting all of them first without index, then add a foreground index once done. — Wan B., Feb 18 '19 at 23:11
yes, everything in same machine.. I could do that but I can't use the results while index isn't over so not sure if there's any gain from it... but I'll give it a go thanks. post suggested by @RichieK has very nice improvements on sqlite but it is still so slow... — Filipe Pina, Feb 27 '19 at 23:24

Best way to index large file

0 Answers0