3

I'm developing an online file storage service in mainly PHP and MySQL, where users will be able to upload files up to 10 - 20 GB in size.

Unregistered users will be able to upload files but not in a personal storage space, just a directory where all file uploads of unregistered users will be stored.

Registered users will get a fixed amount (that might increase in the future) of personal storage space and access to a file manager to easily manage and organize all their files. They'll also be able to set their files private (not downloadable by anyone but themselves) or public.


What would be a good possible directory set-up?

I'm thinking about a "personal" directory that will contain folders with the user's id as the folder name for each registered user.

Alongside the personal directory, there will be an "other" folder which will just contain every file that's been uploaded by unregistered users.

Both will contain uploaded files, with each their corresponding row id (from the files table in the database) as the file name.

ROOT
  FOLDER uploads
    FOLDER personal
      FOLDER 1
        FILE file_id1
        FILE file_id2
             (...)
      FOLDER 2
        FILE file_id3
        FILE file_id4
             (...)
        (...)
    FOLDER other
      FILE file_id5
      FILE file_id6
           (...)

This is the first time I'm dealing with a situation like this, but this concept is all so far what I could came up with. Any suggestions are also welcome!

Kid Diamond
  • 2,232
  • 8
  • 37
  • 79
  • 1
    I will support the metadata approach. Here are some Stack Exchange answers which can provide additional information and pointers http://dba.stackexchange.com/questions/35287/storing-metadata-of-various-data-types-in-a-mysql-database and http://stackoverflow.com/questions/9558603/easy-way-to-store-metadata-about-mysql-database – Adrian Aug 08 '14 at 10:18

2 Answers2

2

Basically you need to address the following topics:

  1. Security: With what you described it is pretty unclear who is allowed to read access the files. If this is always "everybody read everything" you set up a file structure within a web server virtual server. Otherwise you set up the folder structure in a "hidden" area and only access those via server side scripts (eg. copy on demand). The secure approach eats more ressources, but opens room to setup a technically optimized folder structure.

  2. OS constraints: Each OS limits there number of items and/or files per folder. The actual figures of limitation depend on the os specific configuration of the file system. If I remember that right, there are LINUX setups that support 32000 items per folder. At the end of the day the example is not important. However importance lays on the fact, that your utilization planning does not exceed the limitations on your servers. So if you plan to provide your service to 10 users you may likely have a folder "other", if you target at a million users you probably need lots of folders "other". If you also do not want to restrict your users in number of files being uploaded you probably need the option to extend the folder per user. Personally I apply a policy where I not have more than 1000 items in a folder.

  3. SEO requirements: If your service needs to be SEO complaint, it needs to be able to present speaking names to users - ideally without general categorization such as "Personal"/"Other". Your proposed structure may meet this requirement. However the OS constraints may force you into a more technical physical structure (eg. where chunk item id into 3 digits and use those to make up your folder and file structure). On top of that you can implement a logical structure which then converts IDs into names. However such implementation means file access via server side scripts and therefore demands for more ressources. Alternatively you could play with webserver url rewrites...

  4. Consistency + Availability + Partition tolerance: Making your service a service likely requires you to have a balanced setup according those. Separating the beast into physical and logical layer helps here a lot. Consistency + Availability + Partition tolerance would be dealt with at the logical layer. http://en.wikipedia.org/wiki/NoSQL might be your way to go forward. http://en.wikipedia.org/wiki/CAP_theorem for details on the topic.

====================== UPDATE

From the comments we know now that you store meta data in an relational database, that you have physical layer (files on disk) and logical layer (access via php scripts) and that you base your physical file/folder layer on IDs.

This opens room to fully move any structural considerations to the relational database and maybe to improve the physical layer from the very beginning. So here are the tables of the sql database I would create:

 ======
 users
 ======
 id (unsigned INT, primary key)
 username
 password
 isregisteredflag
 ...any other not relevant for the topic...

 ======
 files
 ======     
 id (unsigned INT,primary key)
 filename
 _userid (foreign key to users.id)
 createddate
 fileattributes
 ...any other not relevant for the topic...

 ======
 tag2file
 ======
 _fileid (foreign key to files.id)
 _tagid (foreign key to tag.id)

 ======
 tags
 ======
 id  (unsigned INT,primary key)
 tagname

Since this structure allows you to derive files from user IDs and also you can derive userID from files you do not need to store that relation as part of your folder structure. You just name the files on the physical layer files.id, which is a numeric value generated by the database. Since the ID is generated by the datebase you make sure to have them unique. Also now you can have tags which gives a richer categorization experience to your users (if you do not like tags you could do folder instead as well - in the database).

Taking care for at point 4 very much impacts on your design. If you take care after you did set up the whole thing you potentially double efforts. Since everything is settled to build files from numeric IDs it is a very small step to store your physical files in a key value store in a no-sql database (rather than on the file system), which makes your system scalable as hell. This would mean you would employ a sql database for meta and structure data and a nosql database for files content.

Btw. to cover your public files I would assume you to have a user "public" with ID=1. This ends up in some data hardcoding which is meant to be ugly. However as the functionality "public" is such a central element in your application you can contribute to unwritten laws by documenting that in a proper way. Alternatively you can add some more tables and blow up your code to cover two different things in a 'clean' way.

Quicker
  • 1,247
  • 8
  • 16
  • The uploaded files of unregistered users will be publicly accessible to anyone who's got the link. Those of registered users will be private by default, but they will have the option to publish folders/files. I'll use PHP to secure this. I'm running Linux with [ext4](http://stackoverflow.com/a/466596/1115367), so the amount of files per directory is unlimited. I will be using MySQL to store the files metadata. As far as creating human friendly URLs, I will have my URLs rewritten so that the folder and file names are used instead of a bunch of (to the user) meaningless id's. – Kid Diamond Aug 06 '14 at 08:05
  • In that case I do not get, why you plan to have extra folder layers under "folder uploads". I would assume, that you can assign files to users via your mysql db. So the folder 1, folder 2 can by used generic for grouping to keep the whole thing extendable (but not for user assgnment), which is a placeholder future insurrance - ok. But at least I do not get the extra value of having the folder personal and folder other. Also what are you doing about point 4 (consistency, avail., partition tolerance)? – Quicker Aug 06 '14 at 12:35
  • The current folder structure is just to keep the data organized. Every user has a folder which contain all physical data of *only* that user. So folder `1` belongs to user id `1`, etc. The folder `personal` is only meant for signed up users, and the folder `other` is meant for visitors with no account that upload files. As far as your point 4, I'm not so sure what I'm supposed to understand from that. – Kid Diamond Aug 06 '14 at 13:22
  • Your folder structure adds addressing effort (addressing file xyz causes you to research the user to actually access the file), but does not seem to pay back benefits. This seems to be a minor issue. However extra effort adds up. For point 4: you said, you set up a service. I would expect that your customers expect you to serve files no matter what incidents happen on your side. Incidents that happen to others are eg. disk crashs. You can address those by redundancy, which adds management demands. NoSQL might be a management concept and tool that helps here. – Quicker Aug 06 '14 at 13:42
  • I could actually just make one folder called `Uploads` with no sub-directories and just store *every* file. But I would then additionally to the file id have to add a user identifier to the file names in a format like `[user_id]_[file_id]`. So file 1 from user 1 would be name `1_1`, file 2 from the same user `1_2`, etc. And if the file belongs to a visitor have the id set to 0, like `0_[file_id]`. – Kid Diamond Aug 06 '14 at 14:45
  • All those ID's correspond to the row ID in the database. So file `234_24` would translate to File ID 24 from User ID 234. I don't know if you have a better approach to this. I could make the file names just be an integer incremented by 1. But that would mean I'd have to add an *additional* column in my table to track those names. As for point 4, I think that goes a little beyond the scope of my post. I will deal with that later after I've finished majority of the application. Currenlty I'm only concerned about getting the database and directory structure right. – Kid Diamond Aug 06 '14 at 14:47
  • You can actually view my entire database schema [here](http://i.stack.imgur.com/Xb3qw.png) (missing column `metadata` though). And for files with no user, I just check if there is no related record in the `user_files` table. If there is, it's a user file, if there isn't it's a 'public' file. Also the thing is that I only know MySQL and have no clue about NoSQL and how hard it would be to master. I've [decided](http://stackoverflow.com/questions/3748/storing-images-in-db-yea-or-nay) to store the physical files on my file system and that storing them directly in the database is a no go for me. – Kid Diamond Aug 07 '14 at 07:58
  • Your way of finding 'public' files works as well. The benefit of explicit flagging is robustness. What if in your solution by accident users are deleted, but for any reason the related files were not? - Once you store your data in NOSQL you earn some flexibility. You can add, drops nodes at any time and you can make it replicating data in a 'push button' fashion - although I must admit you have to know what you are doing. However I guess you had to learn SQL as well before you draw your schema... – Quicker Aug 07 '14 at 08:05
  • BTW. the decided link you provided is a discussion about storing file data in relational databases. I would absolutely agree to not do that. I am saying it may add benefit to store data in a NOSQL database, whereby there are NOSQL databases which are optimized by max to serve file data - even better than any OS file system. – Quicker Aug 07 '14 at 08:12
  • 1
    Anyway I have to leave this topic. Your question is definitely answered. I wish you good luck. – Quicker Aug 07 '14 at 08:13
  • I would probably have to master SQL very good indeed. But I think for now MySQL would do the job just fine, I can always decide to go with another database later (even though that's gonna need more effort). As far as (rare) situations like accidental user deletion while it's files are still there, that's where my directory structure comes in handy; I can just look up the folder with the user id and act accordingly. I will probably accept your answer, but I'll wait to see if someone else has some suggestions I haven't thought of before. Thanks for your help! – Kid Diamond Aug 07 '14 at 08:15
0

In my opinion, it shouldn't actually matter which folder structure you have. Of course (as already mentioned), there are OS and FS restrictions, and you may want to spend a thought or two on scaling.

But in the end, I would recommend a more flexible approach to storage and retrieval:

  • Ok, files are physically stored somewhere in a file system.
  • But: There should be a database with meta information about the file like categories, tags, descriptions, modification dates, maybe even change revisions. Of course, it will also store the physical position of the file, which may or may not be on the same machine.
  • This database would be optimized for searching by those criteria. (There are a couple of libraries for semantical indexing/searching, depending on your language/framework.)

This way, you would separate the physical concerns of the logical/semantical ones. And if you or your users still want the hierarchical approach, you can always go with the category logic.

Finally, you will have a much more flexible and appealing file hosting service.

lxg
  • 12,375
  • 12
  • 51
  • 73
  • That was my intention from the start; storing the meta data in the database and physical file location. – Kid Diamond Aug 06 '14 at 21:46
  • Yes, but then IMO the folder structure is irrelevant. Files are just somewhere, and users decide in which semantical taxonomy they are presented. – lxg Aug 06 '14 at 23:28
  • Currently in my case they're not irrelevant. Each folder or file name is parallel to its row id in the database. This way I don't have to track an additional field in my tables. Every row id in my files table corresponds to a file name as well. And the same for every user with files. So for file table row id `34` the physical file is also named `34`. For file `10` of user `465` the physical file `10` would be contained in folder `personal/456/`. For files that do not have a user (files uploaded by unregistered visitors), they will be stored in the `other/` folder. – Kid Diamond Aug 07 '14 at 07:12