-1

I'm working with millions of small images (~100x100) of different sizes.

  1. If I store them as jpgs on a harddisk, they would exceed my disk's inode limit.

  2. If I store them as binary files like HDF5, they would take up >100GB even when I apply compression (h5py's gzip lossless compression is nowhere as compact as jpeg's compression).

Are there any standard ways to store these images as a single file with jpeg compression so that it'll neither take up lots of inode or harddisk space? I'd also like to read these images easily through python.

matohak
  • 535
  • 4
  • 19
  • "*Are there any standard ways to store these images as a single file with jpeg compression...*" Can you clarify this? You're seeking to stuff "millions of small" JPEGs into a single file? Can you elaborate on how exactly that in itself would save storage space, and how you would plan on expanding them to "*read these images easily through python*"? – esqew Oct 28 '20 at 01:52
  • Blobs in a database, perhaps? Very few (possibly only one) files needed, but some overhead would be added. – jasonharper Oct 28 '20 at 03:02
  • (Re: esqew): I can't fit millions of jpgs onto my harddisk because it runs out of inodes. So I tried to store them as a single hdf5 file, but the total file size gets very large because I'm not aware of any jpeg compression filters in hdf5 format. I am trying to see if there's a solution that offers best of both worlds: few files and high compression ratio (can be lossy) – matohak Oct 28 '20 at 21:52
  • (Re: jasonharper) Are there any tutorials on this? I did a quick google search and it seem to require a bit of knowledge in mysql – matohak Oct 28 '20 at 21:58
  • What's your inode limit? A million doesn't sound like much. And don't use (Re: name) but @name, then people get notified. – superb rain Oct 29 '20 at 02:23
  • @superbrain it's a remote computer shared with other people. The admin wouldn't respond to me regarding the inode limit but I've been hitting the limit and was told not to store many small files. – matohak Oct 30 '20 at 10:18

1 Answers1

0

Ext4's bytes-per-node

If images are mostly of the same size, you can choose an optimal bytes per inode. You'll need something less than the default 16384 to better match your image sizes.

Loop device

If reformatting the disk is not an option, you can mount a "loop" device:

dd if=/dev/zero of=./single-file bs=512 count=2M  # 1G
mkfs.ext4 -i 1024 ./single-file                   # 1K per inode

mkdir /mnt/small-images/
mount ./single-file /mnt/small-images

mv 01.jpg /mnt/small-images/
darw
  • 941
  • 12
  • 15
  • I'm on a remote cluster and don't have root privilege. Is there any way round to use this loop device? – matohak Oct 29 '20 at 22:05
  • Maybe the cluster has `fuse`, `udisks2`, or `libguestfs-tools` installed. Then you could check [How to mount an image file without root permission?](https://unix.stackexchange.com/questions/32008/how-to-mount-an-image-file-without-root-permission) – darw Oct 31 '20 at 09:02
  • unfortunately none of those were installed on the cluster – matohak Oct 31 '20 at 12:05