0

recently I am trying to train a text recognition network. I tried to start the training by feeding the mjsynth dataset to network. However, there seems to be some images in the dataset which are blank. So, while training, if I directly feed the data to network, it generates the error while reading the image, and because of this error, training stops. Does anyone know the list of the blank images in mjsynth dataset. So that I can remove those blank images from the dataset.

jd95
  • 404
  • 6
  • 14

1 Answers1

0

After trying many things, I ended up running a pretty long experiment to read almost 9 million images of the mjsynth dataset and collected images which are currupted or are blank. I found that theren are 12 currupted images which stops the model training when the mjsynth data is directly fed to the model for training without any varification. Here is the code and founded invalid images. So you can remove this images from the mjsynth dataset before starting the model training.

import os
import cv2
import numpy as np
rootdir = './mjsynth/mnt/ramdisk/max/90kDICT32px'

invalid_images = []
for subdir, dirs, files in os.walk(rootdir):
    for file in files:
        im_path = os.path.join(subdir, file)
        im = cv2.imread(im_path)
        if type(im) != np.ndarray:
            invalid_images.append(im_path)

print('invalid_images = {}'.format(invalid_images ))

# output
invalid_images = 
['./mjsynth/mnt/ramdisk/max/90kDICT32px\\1863/4/223_Diligently_21672.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\913/4/231_randoms_62372.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2025/2/364_SNORTERS_72304.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\495/6/81_MIDYEAR_48332.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\869/4/234_TRIASSIC_80582.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\173/2/358_BURROWING_10395.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2013/2/370_refract_63890.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\368/4/232_friar_30876.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\1881/4/225_Marbling_46673.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\1817/2/363_actuating_904.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\275/6/96_hackle_34465.jpg',
'./mjsynth/mnt/ramdisk/max/90kDICT32px\\2069/4/192_whittier_86389.jpg']
jd95
  • 404
  • 6
  • 14
  • Hi, I am also trying to use MJSynth database for training CRAFT Keras-ocr implementation. I have downloaded the database. It does not contain ground truths files. How did you generate the ground truths or labels map files for the MJSynth? Can you please advise on it? – A.R Feb 14 '22 at 12:45
  • Hi @A.R you can generate groundtruth from the file name of the image. For e.g. './mjsynth/mnt/ramdisk/max/90kDICT32px\\1863/4/223_Diligently_21672.jpg' image contains text "Diligently". So you just have to split filename: – jd95 May 31 '22 at 06:29
  • filename = os.path.basename(''./mjsynth/mnt/ramdisk/max/90kDICT32px\\1863/4/223_Diligently_21672.jpg'') – jd95 May 31 '22 at 06:31
  • gt = filename.split('_')[1] – jd95 May 31 '22 at 06:31
  • Thanks jd95. Actually I needed to train Keras CRAFT text detector model which needs character level GTs. I have generated character level GTs for the training – A.R May 31 '22 at 13:23
  • @A.R, could you please tell me, how did you generate character-level GTs from word-level labels? – jd95 Aug 24 '22 at 07:17
  • we created character level GTs with one of our own tool which we use for the purpose – A.R Sep 01 '22 at 11:20