7

I've been researching NoSQL databases, and a common theme that comes up is that relational databases are unsuitable for storing unstructured data. For example:

Unfortunately, the rigidly defined, schema-based approach used by relational databases... is a poor fit for unstructured and semi-structured data [source]

I'm having a hard time understanding why this is. For example, if I wanted to store an image or some raw text in a relational database, could I not just store it as a text type (e.g. in a single column table or a key-value table)?

user3187713
  • 71
  • 1
  • 1
  • 2
  • 5
    Unstructured data is not an image or text file. It is a collection of data where one record does not look like another. Structured data assumes common fields between records, adding an image field or text field is OK, it is still just a field. It becomes problematic to search the text, but is doable... Unstructured would be a series of verbatim text answers to a question for example, where you'd want to search common pattern (how many people responded positively). That type of searching isn't SQL's strength – Sparky Jan 12 '14 at 17:37

3 Answers3

20

My favorite example of unstructured data which is not a good fit for a relational database is the computer hardware parts database.

Imagine you have a web shop which sells computer hardware. How would your product database look?

Every product has a name, a price and a vendor. But CPUs have a clock rate, a cache size and a # of cores, monitors have a size and resolution, RAM modules have a capacity and hard drives have also a capacity (which can not be compared to that of RAM modules).

How would you store this data in a relational database?

  • You could create a very wide table with hundreds of field for any possible attribute some product could have, but for most product most of these fields will be NULL.
  • You could have a separate table for each product category
  • You could have a huge table with the columns product, property and value which maps all the properties to the values (but what type do you use for value when some properties are numeric and others aren't?)

All three options are valid, but none of them is really satisfying.

But when you have a document-oriented database without a strict schema, it becomes a lot simpler because each entry can have any set of attributes which can have values of any type.

Philipp
  • 67,764
  • 9
  • 118
  • 153
  • Readers might also want to check out an interesting take on the problem of storing unstructured data in a relational database by @PerformanceDBA in [Q: Database schema which can support specialized properties](http://stackoverflow.com/questions/4304217/database-schema-which-can-support-specialized-properties) – Mr. Deathless Jan 15 '16 at 10:09
  • 7
    "You could have a separate table for each product category" This is the exact solution you should use in this situation. I'm curious about why you think it's unappealing? – Gagege Dec 21 '16 at 12:49
  • Yes. Separate table for the attributes of each category seem fine. That's effectively what you achieve with an unstructured database, but without any guarantees that the child data is valid. There are benefits to NoSQL, but I'm not sure that unstructured data storage is one of them. – Ryan Sep 09 '18 at 23:25
6

The question seems to be based on two or three misconceptions. Unfortunately they are all too common ones among enthusiasts of faddish NoSQL products.

Firstly information (not "data") is never truly unstructured. Structure is the lens through which we view data in order to see information. Structure is the reason why data is useful.

Secondly, the commonly cited examples of such data (documents, images, mixed content) are extremely suitable candidates for storing in relational form.

Thirdly, SQL != Relational. The rationale for NoSQL products is that alternatives to SQL are needed. That is beyond doubt. Unfortunately NoSQL advocates tend to base their ideas on a misconception that the problems and limitations of SQL DBMSs are problems inherent in the relational model of data. That is not remotely true. A strong case could be made that the very best kind of NoSQL DBMS would be a relational one.

nvogel
  • 24,981
  • 1
  • 44
  • 82
  • Very thoughtful comment. New to unstructured databases. I agree that `information (not "data") is never truly unstructured"`. However, I am coming to realize that a lot of time, it is difficult to assign a structure to data, when it is generated. It may take a lot of time for that structure to evolve. If the only option is a structured database, then the only option is to either throw away the data or store it as flat files. – alpha_989 Aug 25 '18 at 21:09
  • On the other hand if we have unstructured databases, which can handle data that doesn't have any structure at the point of entry to the database, perhaps somebody can figure out in future what structure to assign to that data (and extract the `information` contained in the data), which would add value to the business which is collecting the data. Do you disagree? – alpha_989 Aug 25 '18 at 21:09
5

I don't think the question should be unstructured vs. unstructured data. It's more about performance for large quantities of data. I have some experience trying to make a SQL database into a non-structured data store. In my case, I had a bunch of dynamic (JSON) objects that needed to go into a table. I was using SQL because the objects were related to one another through parent-child relationships (i.e. a self-join). It worked fine for a test data set of about 5,000 objects.

Using SQL

HOWEVER, my production database contains about 3gb worth of data (about 1 million objects, give or take). I spent weeks building and optimizing my sql joins and queries. I was able to achieve a maximum performance of about 10ms to return a few nodes from a selected spot in the tree. Then, I ran into strange query performance issues that could only be solved by re-structuring the indices and/or dropping and re-creating the stored procedures. I was spending as much time maintaining the damn SQL database as I was coding the rest of my application. Not good. (Oh, and I should mention that I have about 3 years of hands-on DBA experience with SQL server, so I am by no means new to the game).

Using Couchbase

Fast-forward 18 months. I am now using Couchbase (a popular nosql database). I was able to get the identical functionality from CB by using views and map/reduce. I spent one week getting my CB deployment up and running. Latency on query lookups is sub-millisecond. The end user notices a dramatic increase in performance.

Bottom Line

If you have a lot of data, you are going to be hard-pressed to find a situation where SQL will come anywhere near the performance of a nosql database architecture, regardless of how structured or unstructured the data is.

theMayer
  • 15,456
  • 7
  • 58
  • 90
  • Thanks for sharing your experiences! Did distribute your database across several machines? My understanding is that MapReduce is largely inefficient on a single machine. – user3187713 Jan 14 '14 at 14:20
  • MapReduce/Hadoop is mostly useful for large amounts of data. If you can fit the data into a single machine.. perhaps some other architecture is more useful. – alpha_989 Aug 25 '18 at 21:10
  • Ah, architecture is a function of utility, not necessarily size. – theMayer Aug 26 '18 at 03:39