9

Creating datasets for training with Caffe I both tried using HDF5 and LMDB. However, creating a LMDB is very slow even slower than HDF5. I am trying to write ~20,000 images.

Am I doing something terribly wrong? Is there something I am not aware of?

This is my code for LMDB creation:

DB_KEY_FORMAT = "{:0>10d}"
db = lmdb.open(path, map_size=int(1e12))
    curr_idx = 0
    commit_size = 1000
    for curr_commit_idx in range(0, num_data, commit_size):
        with in_db_data.begin(write=True) as in_txn:
            for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)):
                d, l = data[i], labels[i]
                im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l))
                key = DB_KEY_FORMAT.format(curr_idx)
                in_txn.put(key, im_dat.SerializeToString())
                curr_idx += 1
    db.close()

As you can see I am creating a transaction for every 1,000 images, because I thought creating a transaction for each image would create an overhead, but it seems this doesn't influence performance too much.

Simikolon
  • 317
  • 3
  • 9
  • why aren't you using the [`convert_imageset`](http://stackoverflow.com/a/31431716/1714410) tool? – Shai Jul 27 '15 at 09:41
  • @Shai: Actually I wasn't aware of, but I also don't have my images as files. Though, why should it be faster? Is the Python implementation so slow? – Simikolon Jul 27 '15 at 09:53
  • I'm working with `convert_imageset` to woek on ilsvrc12 (imagenet) converting datasets of ~1M images, it takes a while but it works. – Shai Jul 27 '15 at 10:30
  • where do you get your `data` from? – Shai Jul 27 '15 at 10:31
  • I have HDF5 files containing my data. I know Caffe can use HDF5 files as data source, unfortunately when doing so Caffe does not allow data transform. – Simikolon Jul 27 '15 at 11:51
  • What transformations do you require? – Shai Jul 27 '15 at 12:02
  • Actually, I want to use data augmentation like cropping and mirroring. – Simikolon Jul 27 '15 at 12:07
  • Then, you can either save you hdf5 images to jpegs and process them through the conventional pipeline that allows you for data augmentation. Or, you can manually crop and mirror creating additional numpy arrays saving them to HDF5 and feeding the augmented HDF5 to the net. – Shai Jul 27 '15 at 12:24

4 Answers4

6

In my experience, I've had 50-100 ms writes to LMDB from Python writing Caffe data on ext4 hard disk on Ubuntu. That's why I use tmpfs (RAM disk functionality built into Linux) and get these writes done in around 0.07 ms. You can make smaller databases on your ramdisk and copy them to a hard disk and later train on all of them. I'm making around 20-40GB ones as I have 64 GB of RAM.

Some pieces of code to help you guys dynamically create, fill and move LMDBs to storage. Feel free to edit it to fit your case. It should save you some time getting your head around how LMDB and file manipulation works in Python.

import shutil
import lmdb
import random


def move_db():
    global image_db
    image_db.close();
    rnd = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(5))
    shutil.move( fold + 'ram/train_images',  '/storage/lmdb/'+rnd)
    open_db()


def open_db():
    global image_db
    image_db    = lmdb.open(os.path.join(fold, 'ram/train_images'),
            map_async=True,
            max_dbs=0)

def write_to_lmdb(db, key, value):
    """
    Write (key,value) to db
    """
    success = False
    while not success:
        txn = db.begin(write=True)
        try:
            txn.put(key, value)
            txn.commit()
            success = True
        except lmdb.MapFullError:
            txn.abort()
            # double the map_size
            curr_limit = db.info()['map_size']
            new_limit = curr_limit*2
            print '>>> Doubling LMDB map size to %sMB ...' % (new_limit>>20,)
            db.set_mapsize(new_limit) # double it

...

image_datum                 = caffe.io.array_to_datum( transformed_image, label )
write_to_lmdb(image_db, str(itr), image_datum.SerializeToString())
Íhor Mé
  • 896
  • 9
  • 13
  • 1
    Can you give a bit more context what `tempfs` is? – Steve Heim May 10 '16 at 21:42
  • Can you please provide specific code describing your solution/workflow? – Shai May 11 '16 at 04:29
  • 3
    This is an excellent suggestion! @SteveHeim See [this post](http://askubuntu.com/questions/152868/how-do-i-make-a-ram-disk) for details on creating a RAM disk in Ubuntu. Rather than writing data to a hard disk, which can be very slow when a large number of writes are involved, you can mount a directory to a RAM location. While the interface is the same as any other directory, read and write access to the mounted directory will be orders of magnitude faster. When you're finished using your database you can then move it to another directory on a hard disk for long term storage. – Jake May 28 '16 at 19:19
  • Steve, as I wrote, tempfs is a RAM disk fs on Linux. You can use a different RAM disk filesystem if you're on another OS, it doesn't matter. – Íhor Mé Jul 05 '16 at 17:47
  • Actually, sorry for the typo, I meant tmpfs. Shai, my specific code is pretty specific as I'm getting my data through interprocess communication via sockets - I'm distorting stuff on HTML canvas and POSTing it to a socket. But okay, I'll update my answer to include code for manipulating DBs. Read about tmpfs somewhere else, though. That one is well documented. – Íhor Mé Jul 05 '16 at 17:55
3

Try this:

DB_KEY_FORMAT = "{:0>10d}"
db = lmdb.open(path, map_size=int(1e12))
    curr_idx = 0
    commit_size = 1000
    with in_db_data.begin(write=True) as in_txn:
        for curr_commit_idx in range(0, num_data, commit_size):
            for i in range(curr_commit_idx, min(curr_commit_idx + commit_size, num_data)):
                d, l = data[i], labels[i]
                im_dat = caffe.io.array_to_datum(d.astype(float), label=int(l))
                key = DB_KEY_FORMAT.format(curr_idx)
                in_txn.put(key, im_dat.SerializeToString())
                curr_idx += 1
    db.close()

the code

with in_db_data.begin(write=True) as in_txn:

takes much time.

Skyduy
  • 229
  • 2
  • 5
1

LMDB writes are very sensitive to order - If you can sort the data before insertion speed will improve significantly

Ophir Yoktan
  • 8,149
  • 7
  • 58
  • 106
1

I did a small benchmark to illustrate Ophir's point:

Machine:

RasPi 4B - overclock to 1.75 GHz, 4GB, RasperryPi OS, OS on SSD

Code:

def insert_lmdb(fsobj, transaction):
    transaction.put(key=str(fsobj).encode("utf-8", "ignore"), value=generate_hash_from_file(fsobj).hexdigest().encode("utf-8", "ignore"))
list_f = list_files(FOLDER)

print(f"\n> Insert results in lmdb <")
list_f = Directory(path=DIR_ECTORY, use_hash=False, hash_from_content=False).lists["files"]

# list_f = sorted(list_f) # Run only in the 'sorted' case.

st = timeit.default_timer()

env = lmdb.open(path=DB_NAME)

with env.begin(write=True) as txn:
    for i in list_f:
        insert_lmdb(i, transaction=txn)
average = (timeit.default_timer() - st)*1000000/records

print(f"Test repeated {TIMES} times.\nNumber of files: {records}\nAverage time: {round(average, 3)} us or {round(1000000/average/1000, 3)}k inserts/sec")

Results:

Without sorting:

> Insert results in lmdb <
Test repeated 50000 times.
Number of files: 363
Average time: 84 us or 12k inserts/sec

With sorting:

> Insert results in lmdb <
Test repeated 50000 times.
Number of files: 363
Average time: 18.5 us or 54k inserts/sec

Sorting brought a 4.5 times speed increase in writes, not bad for only one extra line of code :).