1

The naive way would be to put the whole path into the DB as a string and it would work for toy DBs. However, this approach have a couple of flaws. For example, say I have 100K files under /var/www/sites/ then storing /var/www/sites 100K times in the DB is very inefficient. I am sure there is a much better way to do this.

I would like to index only the file paths on a DVD and later search for mp3 files or for directories, etc. The preferred RDBMS is SQLite (perhaps FTS Tables?). My goal is to learn, I know there are a bunch of desktop search engines for this.

Ali
  • 56,466
  • 29
  • 168
  • 265
  • Storing a file path or URL in a database is not a violation of Normal Forms! I'd go ahead and do it if it makes sense for your needs. – nvogel Jan 30 '11 at 15:16
  • My goal is to learn, this problem can be solved with a bunch of desktop search engines. I highly doubt that repeating data in a DB is the way to go. – Ali Jan 30 '11 at 15:35
  • 1
    If you have no further questions, could you please vote, and choose an answer. – PerformanceDBA Feb 02 '11 at 16:12
  • @PerformanceDBA I have not abandoned this question, I just do not have the time at the moment to continue this discussion. Please give me some time, I am interested in getting an answer. BTW, thanks for your detailed answers so far. – Ali Feb 02 '11 at 18:09

4 Answers4

5

The naive way would be to put the whole path into the DB as a string and it would work for toy DBs. However this approach yields a non-normalized DB.

Who told you that ? That is the most ridiculous thing I have heard in a long time. Get rid of them as soon as you can, and do not pay them for such absurd "advice".

Short Answer

That's as absurd as saying, if you store phone numbers or addresses in the database in their raw form, it is naïve and not normalised.

Put your URLs in a single column in the database (high end or low end). It does not break Normalisation rules. (Assuming of course that the database is normalised in other respects.)

Long Answer

Let's look at two counterpoints.

Some people do not understand that Normalisation is a Principle. Sure, in the application of that principle in databases, we have Normal Forms, and you either comply with or break Normal Forms. But that isn't the whole principle. You could just as easily have a shocking database because it is not Normalised, even though it can be in 3NF.

Let's say you have a Customer table that has a set of columns that make up "address". And a Supplier table that also has the same (hopefully exactly the same) columns that make up "address". As long as the Functional Dependencies have been resolved, that is correct, there is nothing is the Normal Forms that will identify that it does not meet 3NF or 5NF. Such a database will be fine. But a good designer (as opposed to a qualified but inexperienced one) will Normalise the "address" columns into a separate Address table, and place an FK to it in the Customer and Supplier tables. That designer gives you a more Normalised database, that is even easier to maintain, but it is still in the same 3NF or 5NF as before.

For the newbie Normalisator, they need to Normalise everything. They forget the purpose of the database, and Normalise to a degree that is beyond its purpose. By the same reasoning of the person who told you that, the "address" columns and contents of those columns are "not Normalised". As long as you have Washington St, Washington Blvd, Washington Lane, holy moley, "that is naïve and the database is not normalised". Absolute nonsense.

For the purpose of most databases, storing the street name and the street type in a single column is quite adequate. And if you had a good designer, sure they would implement a separate Address table. The multiple occurrences of "Washington" in the street names cannot be said to be "duplicates". But if you were the city council or the electricity utility, you would have a different purpose, in which case that would not be good enough, and yes, there you would Normalise the "address" column group to the nth degree, such the "Washington" or "Street" never repeats as a data value. And for that you need a very experienced designer. Only true for a small minority with a different purpose.

Therefore, if your database has the purpose of carefully analysing the entire content of the URLS, and reconstructing a tree or explorer style view, then by all means, build a directory structure in tables, which allows storage of each component of the URL, and a hierarchy, and never duplicate any component. But if your purpose is simply to store URLS like most people store addresses or phone numbers, then just store raw URLs like addresses or phone numbers. You can perform quite reasonable searches and match on component parts of the raw URL, to find MP3 files or whatever.

There is no "best" without a measurement of what the criteria is. There is no one-size-fits-all. The electricity utility database is "too complex" (too Normalised) for most purposes; the usual database is "inadequate" for the electricity utility. If you identify the purpose, the types of searches you require, that identifies the criteria against which "best" or "better" or "fails" can be measured.

Response to Comments

Your Edit has changed the landscape. While the usual level of Normalisation may be the adequate level for most people (it is therefore not "naïve"), you need something more, you are closer to the electricity utility, you need a Normalised Directory structure to store URLS or full Paths, and you need to remove duplication from the data values. Eg. /var, /www, /sites, etc stored once.

Normalised Directory

No problem. That too, has been done many times. I have posted the exact requirement in another answer.

Rest assured that that exact structure runs in two large Enterprise class servers, and that generic structure runs in virtually every SQL database I have written over more than 25 years. It may look complex, but once you get your head around it, it is simple and flexible. Allows full recursion, etc.

You can ask questions in the comments here.

halfer
  • 19,824
  • 17
  • 99
  • 186
PerformanceDBA
  • 32,198
  • 10
  • 64
  • 90
  • "build a directory structure in tables, which allows storage of each component of the URL, and a hierarchy, and never duplicate any component" My question is: how to do that? – Ali Jan 30 '11 at 17:17
  • Hi @PerformanceDBA. I have noticed that some of your answers contain commentary on meta-issues such as voting, the quality of other questions, the competence of readers or question authors, etc. These things are broadly not germane to answers, so it is probably best to avoid them. It is true that not everyone is greatly talented, but readers know that already, and most readers are not here for that sort of rumination anyway `:-)`. – halfer Mar 08 '19 at 18:42
2

However this approach yields a non-normalized DB.

So what? 3rd NF isn't holy. Some forms of denormalization result in easier to understand datamodels. As long as the duplication doesn't cause problems in terms of database size or CPU-load while transforming/parsing the unnormalized values, I wouldn't worry about it.

Confusion
  • 16,256
  • 8
  • 46
  • 71
1

This problem has a name: storing hierarchical data and has well-estabilished solutions
Managing Hierarchical Data in MySQL
One more Nested Intervals vs. Adjacency List comparison
How to store directory / hierarchy / tree structure in the database?
Optimized SQL for tree structures

Community
  • 1
  • 1
Ali
  • 56,466
  • 29
  • 168
  • 265
0

I think a generally sensible approach is to remove any commonality from the stored paths. For example, if you only cared about items within /media/ then there's obviously not point in storing the /media/ portion of the path, as you can trivially add this when you need to re-constitute the full path.

John Parker
  • 54,048
  • 11
  • 129
  • 129