31

Currently, I've got images (max. 6MB) stored as BLOB in a InnoDB table. As the size of the data is growing, the nightly backup is growing slower and slower hindering normal performance.

So, the binary data needs to go to the file system. (pointers to the files will be kept in the DB.)

The data has a tree like relation:

- main site
  - user_0
    - album_0
    - album_1
    - album_n
  - user_1
  - user_n
etc...

Now I want the data to be distributed evenly trough the directory structure. How should I accomplish this?

I guess I could try MD5('userId, albumId, imageId'); and slice up the resulting string to get my directory path:

  /var/imageStorage/f/347e/013b/c042/51cf/985f7ad0daa987d.jpeg

This would allow me to map the first character to a server and evenly distribute the directory structure over multiple servers.

This would however not keep images organised per user, likely spreading the images for 1 album over multiple servers.

My question is:
What is the best way to store the image data in the file system in a balanced way, while keeping user/album data together ?

Am I thinking in the right direction? or is this the wrong way of doing things altogether?

Update:
I will go for the md5(user_id) string slicing for the split up on highest level. And then put all user data in that same bucket. This will ensure an even distribution of data while keeping user data stored close together.

  /var
   - imageStorage
     - f/347e/013b
       - f347e013bc04251cf985f7ad0daa987d
         - 0
           - album1_10
             - picture_1.jpeg
         - 1
           - album1_1
             - picture_2.jpeg
             - picture_3.jpeg
           - album1_11
             - picture_n.jpeg
         - n
           - album1_n

I think I will use albumId splitted up from behind (I like that idea!) as to keep the number of albums per directory smaller (although it won't be necessary for most users).

Thanks!

Can Vural
  • 2,222
  • 1
  • 28
  • 43
Jacco
  • 23,534
  • 17
  • 88
  • 105
  • Ah - I'd suggest editing "nicely distribute" to "evenly distribute". I now realize that your goal is to try and average out the number of pictures per file system folder. – J c Oct 10 '08 at 15:50
  • Have you considered doing incremental backups of the DB? – Tahir Akhtar Oct 10 '08 at 15:31
  • 1
    I think that incremental backups would only temporarily solve the problem. – Jacco Oct 10 '08 at 17:22

3 Answers3

23

Just split your userid from behind. e.g.

UserID = 6435624 
Path = /images/24/56/6435624

As for the backup you could use MySQL Replication and backup the slave database to avoid problems (e.g. locks) while backuping.

Node
  • 21,706
  • 2
  • 31
  • 35
  • 1
    Yep,that's what I was going to say. Reverse the digits in the numeric ID and it's more likely to distribute evenly, kind of round-robin. – Bill Karwin Oct 10 '08 at 16:13
  • @Bill: I don't get it. Why is reversing the userid more likely to distribute evenly? Is it because older users have had more time to upload more images? – Alix Axel Apr 19 '10 at 03:14
  • 4
    @Alix: Suppose 75 userid's are allocated in a monotonically increasing manner. The 1's digit cycles through 0 through 9, and repeats. On average, there are an equal number of occurrances of each digit. The 10's digit cycles too, but only 0 through 7; it never gets to 8 or 9. Also the 100's digit is just 0 -- no distribution at all. So it's better to use the lower digits of the userid as the index for the higher-level directories. – Bill Karwin Apr 19 '10 at 07:14
  • @Bill: Of course, makes perfect sense! Thanks for explaining it to me. =) – Alix Axel Apr 19 '10 at 07:20
  • @Bill, @Node: If the filename is hashed, should the directory structure still be derived from the un-hashed ID or is it better to apply the same strategy against the hashed value? – Wil Moore III Aug 26 '10 at 19:05
  • 1
    @wilmoore: Depends on which hash algorithm you use, but probably you're using md5 or something, where any digit is as likely to be evenly distributed as another digit. So in that case there's no advantage to choosing the rightmost digits for your toplevel directories. You're just as likely to distribute files evenly by choosing the leftmost digits of the hash string. – Bill Karwin Aug 26 '10 at 19:36
  • 2
    what is the user's id is small (such as 5 or 19)? where would you store the images? – cherouvim Dec 29 '10 at 14:42
  • @cherouvim: Reverse it and then zerofill the ID up to 4 characters - `/00/05/0005` or `/00/91/0019` for instance? – Alix Axel Oct 25 '12 at 14:25
7

one thing about distributing the filenames into different directories, if you consider splitting your md5 filenames into different subdirectories (which is generally a good idea), I would suggest keeping the complete hash as filename and duplicate the first few chars as directory names. This way you will make it easier to identify files e.g. when you have to move directories.

e.g.

abcdefgh.jpg -> a/ab/abc/abcdefgh.jpg

if your filenames are not evenly distributed (not a hash), try to choose a splitting method that gets an even distribution, e.g. the last characters if it is an incrementing user-id

Alex Lehmann
  • 668
  • 1
  • 6
  • 11
3

I'm using this strategy given a unique picture ID

  • reverse the string
  • zerofill it with leading zero if there's an odd number of digits
  • chunk the string into two-digits substrings
  • build the path as below

    17 >> 71 >> /71.jpg
    163 >> 0361 >> /03/61.jpg
    6978 >> 8796 >> /87/96.jpg    
    1687941 >> 01497861 >> /01/49/78/61.jpg
    

This method ensures that each folder contains up to 100 pictures and 100 sub-folders and the load is evenly distributed between the left-most folders.

Moreover, you just need the ID of the picture to reach the file, no need to read picture table containing other metadata. User data are not stored close together indeed and the ID-Path relation is predictable, it depends on your needs.

fustaki
  • 1,574
  • 1
  • 13
  • 20