Caffe: Converting CSV file to HDF5

Question

I have learned a little about Caffe framework (which is used define and train deep learning models)

As my first program, I wanted to write a program for training and testing a "Face Emotion Recognition" task using fer2013 dataset

The dataset I have downloaded is in "CSV" format. As I know, for working with Caffe, dataset format has to be in either "lmdb" or "hdf5".

So it seems that the first thing I have to do is to convert my dataset into hdf5 or lmbd formats.

Here is a simple code I tried at first:

import pandas as pd
import numpy as np
import csv

csvFile = pd.HDFStore('PrivateTest.csv')
PrivateTestHDF5 = csvFile.to_hdf(csvFile)

print len(PrivateTestHDF5)

But it doesn't work, and I get this error:

" Unable to open/create file 'PrivateTest.csv "

I have searched alot, I found this link but I still can not understand how does it read from a CSV file.

Also I do not have installed Matlab.

I would be happy if anyone can help me on this. Also if any advice about writing caffe models for datasets that are on Kaggle website or any other dataset ( Those who are not on caffe website )

You should be specific in your question, read http://stackoverflow.com/help/how-to-ask. In particular this has nothing to do with Caffe or matlab (although those might be a components of your overarching problem, they're not directly relevant to the issue). I would take a look at the docstrings for `pd.HDFStore` — mgilbert, Aug 07 '16 at 18:36
@mgilbert I didn't know that that talking about Caffe here is unuseful. Do you think it is better to edit my question ? ( removing caffe tag? ) — kadaj13, Aug 07 '16 at 20:22
Yes I would restrict your question to one specific issue, e.g. reading from a csv or writing to an hdf5 file. I would take a look at http://pandas.pydata.org/pandas-docs/version/0.18.1/tutorials.html which gives a good overview of both — mgilbert, Aug 07 '16 at 20:56

Mppl · Answer 1 · 2016-08-08T11:04:52.310

Your input data doesn't have to be in lmdb or hdf5. You can input data from a csv file. All you have to do is to use an ImageData input layer such as this one:

layer {


name: "data"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: false
    crop_size: 224
    mean_file: "./supporting_files/mean.binaryproto"
  }
  image_data_param {
    source: "./supporting_files/labels_train.txt"
    batch_size: 64
    shuffle: true
    new_height: 339
    new_width: 339
  }
}

Here, the file "./supporting_files/labels_train.txt" is just a csv file that contains the paths to the input images stored on the file system as regular images.

This is usually the simplest way to provide data to the model. But if you really have to use HDF5 file you can use something like this function:

import h5py
import sys
import numpy as np



 def create_h5_file(labels,file_name):
        nr_entries = len(labels)
        images = np.zeros((nr_entries, 3, width, height), dtype='f4')
        image_labels = np.zeros((nr_entries, nr_labels_per_image), dtype='f4')
        for i, l in enumerate(labels):

            img = caffe.io.load_image(l[0])

            # pre process and/or augment your data 

            images[i] = img

            image_labels[i] = [int(x) for x in l[1]]

        with h5py.File(file_name, "w") as H:
            H.create_dataset("data", data=images)
            H.create_dataset("label", data=image_labels)

where file_name is a string with the path of the hdf5 output file and labels are and labels is an array of tuples such as ("/path/to/my/image",["label1","label2",...,"labeln"]).

Notice that this function works for datasets with multiple labels per image (one valid reason for using hdf5 instead of a csv file), but you probably only need a single label per image.

Thankyou very very much. I encountered another problem. The file "create_imagenet_mean.sh" has this line: $TOOLS/compute_image_mean $EXAMPLE/ilsvrc12_train_lmdb \ so it seems i need to have a lmdb version of my data to use this script. Am I right? — kadaj13, Aug 10 '16 at 19:11
And also another question. Does this code work on that dataset? because in that dataset labels and images are in a same file and I do not how to change this code for that one. I would appreciate if you can help me on that. Thanks :) @Mppl — kadaj13, Aug 11 '16 at 04:23

Nadav · Answer 2 · 2017-07-05T10:43:08.473

A bit late, but wanted to point out that if the csv file is too big to load into memory you can use pandas "chunksize" to split the file and load the chunks one by one to HDF5:

import pandas as pd

csvfile = 'yourCSVfile.csv'
hdf5File = 'yourh5File.h5'

tp = pd.read_csv('CSVfile', chunksize=100000)

for chunk in tp:
   chunk.to_hdf(hdf5File,  key = 'data', mode ='a', format='table', append = True)

Note that the append = True is for Table format.

Caffe: Converting CSV file to HDF5

2 Answers2