I have a file with about 100gb with a word:tag
per line. I want to index these on word
to easily get the list of tag
s for a given word.
I wanted to save this on boltdb (mainly to checkout boltdb) but random write access is bad so I was aiming to index the file in some other way first, then moving all of it to boltdb without need to check for duplicates or de/serialisation of the tag
list
So, for reference, if I simply read the file into memory (discarding data), I get about 8 MB/s.
If I write to boltdb files using code such as
line := ""
linesRead := 0
for scanner.Scan() {
line = scanner.Text()
linesRead += 1
data := strings.Split(line, ":")
err = bucket.Put([]byte(data[0]), []byte(data[1]))
logger.FatalErr(err)
// commit on every N lines
if linesRead % 10000 == 0 {
err = tx.Commit()
logger.FatalErr(err)
tx, err = db.Begin(true)
logger.FatalErr(err)
bucket = tx.Bucket(name)
}
}
I get about 300 Kb/s speed and this is not even complete (as it's not adding tag
to each word
, only stores the last occurrence). So adding the array and JSON serialisation would definitely lower that speed...
So I gave mongodb a try
index := mgo.Index{
Key: []string{"word"},
Unique: true,
DropDups: false,
Background: true,
Sparse: true,
}
err = c.EnsureIndex(index)
logger.FatalErr(err)
line := ""
linesRead := 0
bulk := c.Bulk()
for scanner.Scan() {
line = scanner.Text()
data := strings.Split(line, ":")
bulk.Upsert(bson.M{"word": data[0]}, bson.M{"$push": bson.M{"tags": data[1]}})
linesRead += 1
if linesRead % 10000 == 0 {
_, err = bulk.Run()
logger.FatalErr(err)
bulk = c.Bulk()
}
}
And I get about 300 Kb/s as well (though Upsert
and $push
here handle appending to the list).
I tried with a local MySQL instance as well (indexed on word
) but speed was like 30x slower...