1

I have a train folder. It this folder there are 2000 images at different sizes . Also I have labels.csv file. When training network, loading and resizing this images is time consuming. So I have read some papers about h5py which is solution for this situation. I tried the following code :

PATH = os.path.abspath(os.path.join('Data'))
SOURCE_IMAGES = os.path.join(PATH, "Train")
print "[INFO] images paths reading"
images = glob(os.path.join(SOURCE_IMAGES, "*.jpg"))
images.sort()
print "[INFO] image labels reading"
labels = pd.read_csv('Data/labels.csv')

train_labels=[]

for i in range(len(labels["car"])):

    if(labels["car"][i]==1.0):

        train_labels.append(1.0)
    else:

        train_labels.append(0.0)

data_order = 'tf' 

if data_order == 'th':
    train_shape = (len(images), 3, 224, 224)
else:
    train_shape = (len(images), 224, 224, 3
print "[INFO] h5py file created"

hf=h5py.File('data.hdf5', 'w')

hf.create_dataset("train_img",
                  shape=train_shape,
                  maxshape=train_shape,
                  compression="gzip",
                  compression_opts=9)

hf.create_dataset("train_labels",
            shape=(len(train_labels),),
            maxshape=(None,),
            compression="gzip",
            compression_opts=9)

hf["train_labels"][...] = train_labels


print "[INFO] read and size images"
for i,addr in enumerate(images):

    s=dt.datetime.now()
    img = cv2.imread(images[i])
    img = cv2.resize(img, (224, 224), interpolation=cv2.INTER_CUBIC)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    hf["train_img"][i, ...] = img[None]
    e=dt.datetime.now()
    print "[INFO] image",str(i),"is saved time:", e-s, "second"

hf.close()

But when I run this code. Code is running hours. At first it is very fast but later reading is very slow, especially at this line hf["train_img"][i, ...] = img[None]. Here output of this program. As you can see, time is constantly increasing. Where am I doing wrong? Thanks for advises.

enter image description here

hrzm
  • 131
  • 2
  • 6
  • Your time profiling contains the time for reading and converting the image. If the image size varies, this is likely the time consuming issue. Maybe the later images are just getting larger, thus taking longer to read and convert? – w-m Jul 27 '18 at 09:01
  • No I'm sure this is not about image size. I tried an found that "hf["train_img"][i, ...] = img[None] " line is waiting. – hrzm Jul 27 '18 at 09:32
  • This is a typical behaviour if you use the wrong chunk_shape (choose for example (1,3,224,224)) and insufficient chunk_cache_size. Take a look at https://stackoverflow.com/a/48405220/4045774 https://stackoverflow.com/a/44961222/4045774 – max9111 Aug 01 '18 at 09:00

1 Answers1

1

train_img is created with compression_opts=9. This is the highest compression level, taking the most work to compress/decompress.

If the time of compressing the image is a bottleneck and you can trade that off for some space taken, use a lower compression level, like the default (=4). Or even disable the compression completely.

w-m
  • 10,772
  • 1
  • 42
  • 49