55

I'm developing a LAMP online store, which will allow admins to upload multiple images for each item.

My concern is - right off the bat there will be 20000 items meaning roughly 60000 images.

Questions:

  1. What is the maximum number of files and/or directories on Linux?

  2. What is the usual way of handling this situation (best practice)?

My idea was to make a directory for each item, based on its unique ID, but then I'd still have 20000 directories in a main uploads directory, and it will grow indefinitely as old items won't be removed.

Thanks for any help.

Matthias Braun
  • 32,039
  • 22
  • 142
  • 171
CodeVirtuoso
  • 6,318
  • 12
  • 46
  • 62

6 Answers6

91

ext[234] filesystems have a fixed maximum number of inodes; every file or directory requires one inode. You can see the current count and limits with df -i. For example, on a 15GB ext3 filesystem, created with the default settings:

Filesystem           Inodes  IUsed   IFree IUse% Mounted on
/dev/xvda           1933312 134815 1798497    7% /

There's no limit on directories in particular beyond this; keep in mind that every file or directory requires at least one filesystem block (typically 4KB), though, even if it's a directory with only a single item in it.

As you can see, though, 80,000 inodes is unlikely to be a problem. And with the dir_index option (enablable with tune2fs), lookups in large directories aren't too much of a big deal. However, note that many administrative tools (such as ls or rm) can have a hard time dealing with directories with too many files in them. As such, it's recommended to split your files up so that you don't have more than a few hundred to a thousand items in any given directory. An easy way to do this is to hash whatever ID you're using, and use the first few hex digits as intermediate directories.

For example, say you have item ID 12345, and it hashes to 'DEADBEEF02842.......'. You might store your files under /storage/root/d/e/12345. You've now cut the number of files in each directory by 1/256th.

glglgl
  • 89,107
  • 13
  • 149
  • 217
bdonlan
  • 224,562
  • 31
  • 268
  • 324
  • I know this is an old post... but after some digging was unable to find something decent. Is there a specific hashing method that would allow you to expect specific alphanumeric characters to be able to store them in separate folders? – Jish Jul 31 '13 at 01:23
  • 4
    @Jish I don't get you. You can use any hash function, convert its result to hex and take the first two hex digits. Then, you ideally have an equal distribution between `[0-9a-f]` for both digits. – glglgl Oct 24 '13 at 08:03
  • I just generated about 150,000 files in the directory but the ls command could not list them using ls myfile* command. But since I know the file name I tried and I could open the first and last file. So I know the files exist. – Chan Kim Oct 28 '16 at 00:50
  • "There's no limit on directories in particular beyond this" This seems to be incorrect. EXT4 even has a limit, and I seem to have hit this issue at about 35 million files, 12https://www.phoronix.com/news/EXT4-Largedir-Linux-4.13 – Chris Stryczynski Jun 22 '23 at 15:26
11

If your server's filesystem has the dir_index feature turned on (see tune2fs(8) for details on checking and turning on the feature) then you can reasonably store upwards of 100,000 files in a directory before the performance degrades. (dir_index has been the default for new filesystems for most of the distributions for several years now, so it would only be an old filesystem that doesn't have the feature on by default.)

That said, adding another directory level to reduce the number of files in a directory by a factor of 16 or 256 would drastically improve the chances of things like ls * working without over-running the kernel's maximum argv size.

Typically, this is done by something like:

/a/a1111
/a/a1112
...
/b/b1111
...
/c/c6565
...

i.e., prepending a letter or digit to the path, based on some feature you can compute off the name. (The first two characters of md5sum or sha1sum of the file name is one common approach, but if you have unique object ids, then 'a'+ id % 16 is easy enough mechanism to determine which directory to use.)

sarnold
  • 102,305
  • 22
  • 181
  • 238
6

60000 is nothing, 20000 as well. But you should put group these 20000 by any means in order to speed up access to them. Maybe in groups of 100 or 1000, by taking the number of the directory and dividing it by 100, 500, 1000, whatever.

E.g., I have a project where the files have numbers. I group them in 1000s, so I have

id/1/1332
id/3/3256
id/12/12334
id/350/350934

You actually might have a hard limit - some systems have 32 bit inodes, so you are limited to a number of 2^32 per file system.

glglgl
  • 89,107
  • 13
  • 149
  • 217
4

In addition of the general answers (basically "don't bother that much", and "tune your filesystem", and "organize your directory with subdirectories containing a few thousand files each"):

If the individual images are small (e.g. less than a few kilobytes), instead of putting them in a folder, you could also put them in a database (e.g. with MySQL as a BLOB) or perhaps inside a GDBM indexed file. Then each small item won't consume an inode (on many filesystems, each inode wants at least some kilobytes). You could also do that for some threshold (e.g. put images bigger than 4kbytes in individual files, and smaller ones in a data base or GDBM file). Of course, don't forget to backup your data (and define a backup stategy).

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • 2
    This is a good mechanism for reducing disk use, but prevents zero-copy mechanisms such as `sendfile(2)` for transferring files without further server software intervention. – sarnold Nov 24 '11 at 01:23
0

The year is 2014. I come back in time to add this answer. Lots of big/small files? You can use Amazon S3 and other alternatives based on Ceph like DreamObjects, where there are no directory limits to worry about.

I hope this helps someone decide from all the alternatives.

2023 - 60,000 is not a lot of files. There is a limit to the number of files that linux can effectively deal with and it comes down to the issues you will hear when trying to scan directories like copy move rename which are overcome by clever use of find and argument trickery. The way programmers have dealt with these limitations is to use directories and limit the number of files by directory. you will see this in wordpress file uploads. For a lot of files 10,000++ object storage is still the best. But I have managed image hosting providers that used 100k plus files using the old ext3/4 filesystem and directory hack.

you also encounter issues on system with ulimit but when you hit it enough times and see it in logs you can increase the ulimit to as much as you want.

I do agree with the comment that cloud adds to some new problems but you have to use it correctly for the right type of job instead of throwing it away.

Abhishek Dujari
  • 2,343
  • 33
  • 43
  • 16
    Ah the irony.... I find myself reading this thread specifically because I have downloaded 2 months worth of AWS CloudTrail logs for lack of any better way to consume them. There seems to be about 300 json files per day. Multiply times 60 days. I have about 18,000 files, and I dumped them all in the same directory. Moral of the story: the year is 2014 and magical cloud services create a bunch of new problems to replace the ones they solved. – David Apr 23 '14 at 04:11
  • You can use other CDN providers who can provide logs in W3C format. I found a bunch of sample codes and combined them to generate what I need. Then pass them to AWStats for example to get my stats. Any coder who is half as serious can achieve this. Suffice to say Object store is not a silver bullet but for the problem mentioned above it is a good solution in 2014 – Abhishek Dujari Oct 15 '14 at 00:52
-3
md5($id) ==> 0123456789ABCDEF

$file_path = items/012/345/678/9AB/CDE/F.jpg 

1 node = 4096 subnodes (fast)
Thomas
  • 43,637
  • 12
  • 109
  • 140
gibz
  • 9