Writing data to LMDB with Python very slow

Question

Creating datasets for training with Caffe I both tried using HDF5 and LMDB. However, creating a LMDB is very slow even slower than HDF5. I am trying to write ~20,000 images.

Am I doing something terribly wrong? Is there something I am not aware of?

This is my code for LMDB creation:

DB_KEY_FORMAT = "{:0>10d}"
db = lmdb.open(path, map_size=int(1e12))
    curr_idx = 0
    commit_size = 1000
    for curr_commit_idx in range(0, num_data, commit_size):
        with in_db_data.begin(write=True) as in_txn:
            for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)):
                d, l = data[i], labels[i]
                im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l))
                key = DB_KEY_FORMAT.format(curr_idx)
                in_txn.put(key, im_dat.SerializeToString())
                curr_idx += 1
    db.close()

As you can see I am creating a transaction for every 1,000 images, because I thought creating a transaction for each image would create an overhead, but it seems this doesn't influence performance too much.

why aren't you using the [`convert_imageset`](http://stackoverflow.com/a/31431716/1714410) tool? — Shai, Jul 27 '15 at 09:41
@Shai: Actually I wasn't aware of, but I also don't have my images as files. Though, why should it be faster? Is the Python implementation so slow? — Simikolon, Jul 27 '15 at 09:53
I'm working with `convert_imageset` to woek on ilsvrc12 (imagenet) converting datasets of ~1M images, it takes a while but it works. — Shai, Jul 27 '15 at 10:30
I have HDF5 files containing my data. I know Caffe can use HDF5 files as data source, unfortunately when doing so Caffe does not allow data transform. — Simikolon, Jul 27 '15 at 11:51
Actually, I want to use data augmentation like cropping and mirroring. — Simikolon, Jul 27 '15 at 12:07
Then, you can either save you hdf5 images to jpegs and process them through the conventional pipeline that allows you for data augmentation. Or, you can manually crop and mirror creating additional numpy arrays saving them to HDF5 and feeding the augmented HDF5 to the net. — Shai, Jul 27 '15 at 12:24

Íhor Mé · Accepted Answer · 2016-07-19T14:37:25.463

In my experience, I've had 50-100 ms writes to LMDB from Python writing Caffe data on ext4 hard disk on Ubuntu. That's why I use tmpfs (RAM disk functionality built into Linux) and get these writes done in around 0.07 ms. You can make smaller databases on your ramdisk and copy them to a hard disk and later train on all of them. I'm making around 20-40GB ones as I have 64 GB of RAM.

Some pieces of code to help you guys dynamically create, fill and move LMDBs to storage. Feel free to edit it to fit your case. It should save you some time getting your head around how LMDB and file manipulation works in Python.

import shutil
import lmdb
import random


def move_db():
    global image_db
    image_db.close();
    rnd = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5))
    shutil.move( fold + 'ram/train_images',  '/storage/lmdb/'+rnd)
    open_db()


def open_db():
    global image_db
    image_db    = lmdb.open(os.path.join(fold, 'ram/train_images'),
            map_async=True,
            max_dbs=0)

def write_to_lmdb(db, key, value):
    """
    Write (key,value) to db
    """
    success = False
    while not success:
        txn = db.begin(write=True)
        try:
            txn.put(key, value)
            txn.commit()
            success = True
        except lmdb.MapFullError:
            txn.abort()
            # double the map_size
            curr_limit = db.info()['map_size']
            new_limit = curr_limit*2
            print '>>> Doubling LMDB map size to %sMB ...' % (new_limit>>20,)
            db.set_mapsize(new_limit) # double it

...

image_datum                 = caffe.io.array_to_datum( transformed_image, label )
write_to_lmdb(image_db, str(itr), image_datum.SerializeToString())

Can you please provide specific code describing your solution/workflow? — Shai, May 11 '16 at 04:29
This is an excellent suggestion! @SteveHeim See [this post](http://askubuntu.com/questions/152868/how-do-i-make-a-ram-disk) for details on creating a RAM disk in Ubuntu. Rather than writing data to a hard disk, which can be very slow when a large number of writes are involved, you can mount a directory to a RAM location. While the interface is the same as any other directory, read and write access to the mounted directory will be orders of magnitude faster. When you're finished using your database you can then move it to another directory on a hard disk for long term storage. — Jake, May 28 '16 at 19:19
Steve, as I wrote, tempfs is a RAM disk fs on Linux. You can use a different RAM disk filesystem if you're on another OS, it doesn't matter. — Íhor Mé, Jul 05 '16 at 17:47
Actually, sorry for the typo, I meant tmpfs. Shai, my specific code is pretty specific as I'm getting my data through interprocess communication via sockets - I'm distorting stuff on HTML canvas and POSTing it to a socket. But okay, I'll update my answer to include code for manipulating DBs. Read about tmpfs somewhere else, though. That one is well documented. — Íhor Mé, Jul 05 '16 at 17:55

score 3 · Answer 2 · answered Sep 15 '15 at 01:56

Try this:

DB_KEY_FORMAT = "{:0>10d}"
db = lmdb.open(path, map_size=int(1e12))
    curr_idx = 0
    commit_size = 1000
    with in_db_data.begin(write=True) as in_txn:
        for curr_commit_idx in range(0, num_data, commit_size):
            for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)):
                d, l = data[i], labels[i]
                im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l))
                key = DB_KEY_FORMAT.format(curr_idx)
                in_txn.put(key, im_dat.SerializeToString())
                curr_idx += 1
    db.close()

the code

with in_db_data.begin(write=True) as in_txn:

takes much time.

score 1 · Answer 3 · answered Jun 22 '17 at 11:15

1

LMDB writes are very sensitive to order - If you can sort the data before insertion speed will improve significantly

answered Jun 22 '17 at 11:15

Ophir Yoktan

8,149
7
58
106

score 1 · Answer 4 · answered Jul 19 '20 at 12:36

I did a small benchmark to illustrate Ophir's point:

Machine:

RasPi 4B - overclock to 1.75 GHz, 4GB, RasperryPi OS, OS on SSD

Code:

def insert_lmdb(fsobj, transaction):
    transaction.put(key=str(fsobj).encode("utf-8", "ignore"), value=generate_hash_from_file(fsobj).hexdigest().encode("utf-8", "ignore"))
list_f = list_files(FOLDER)

print(f"\n> Insert results in lmdb <")
list_f = Directory(path=DIR_ECTORY, use_hash=False, hash_from_content=False).lists["files"]

# list_f = sorted(list_f) # Run only in the 'sorted' case.

st = timeit.default_timer()

env = lmdb.open(path=DB_NAME)

with env.begin(write=True) as txn:
    for i in list_f:
        insert_lmdb(i, transaction=txn)
average = (timeit.default_timer() - st)*1000000/records

print(f"Test repeated {TIMES} times.\nNumber of files: {records}\nAverage time: {round(average, 3)} us or {round(1000000/average/1000, 3)}k inserts/sec")

Results:

Without sorting:

> Insert results in lmdb <
Test repeated 50000 times.
Number of files: 363
Average time: 84 us or 12k inserts/sec

With sorting:

> Insert results in lmdb <
Test repeated 50000 times.
Number of files: 363
Average time: 18.5 us or 54k inserts/sec

Sorting brought a 4.5 times speed increase in writes, not bad for only one extra line of code :).

Writing data to LMDB with Python very slow

4 Answers4

Linked