Bear with me as I am still learning. Essentially, in abstract terms, I have a set of data which could easily fit in a 1NF or 2NF, but also have some that varies in number of items, which I want associated with a record, in which the order must be maintained. be aware, I'm not concerned with any specific database or language, just the very basic approach and theory to this problem.
To simplify to most basic elements, I have an ID, a Goal, and Tasks required in order to complete the goal. For this example, I have excluded other fields like Name (string), Section (string), and so on, as they are straightforward to handle.
At first, I figured, maybe there will only ever be 5 Tasks, as a casual glance of the data set seemed to indicate about 2-3 Tasks (strings). The order in my code was implied to be 1 -> 2 -> 3 and so on.
ID (key), Goal (string), Task1, Task2, Task3, Task4, Task5
I immediately did not like that, as half the values end up being NULL, but it sort of worked, and I was learning some other things like how to call the SQL from my scripting language. Then I started seeing Goals that had 6, 7 and 8 Tasks. :( Do I just keep randomly adding more columns as needed, and thus increase the percentage of NULLs stored? No. Not a good idea.
So I wondered, do I just cram all tasks into one field, and specify a delimiter? Then I could just use a split and join or a regexp to format the data. In this example, my tasks consist of 1-3 tokens of [A-Za-z ']
, so it's easy enough to handle.
ID (key), Goal (string), Tasks (string)
Where Tasks
is of the form task1,task2,task3,...
Something about that just seems to bother me. What if I am working on multiple goals at the same time, and want to get a list of all Names that need the same set of tasks applied? For example, say I have:
123, "Name1", "Goal1", "task1,task2,task3,task4,task5"
456, "Name2", "Goal2", "task2,task3,task4"
789, "Name3", "Goal3", "task3,task4,task5"
How messy it becomes now to look up all records that require task3
? Maybe I could just use a LIKE to find what I want? Seems like a horrible abuse of the function. Could break it all apart, handle the logic in a script, seems even more messy, inefficient, difficult to maintain. For example, making changes to all task3
entries, or changing the order of tasks, would not be good.
Shooting fish in a barrel and using knives on a cutting board could both arguably be used to make sushi...
So I wondered about putting the Tasks data in a separate table, sharing the same ID key. That's look like this.
Main Table
:
123, "Name1", "Goal1"
456, "Name2", "Goal2"
789, "Name3", "Goal3"
Tasks Table
:
123, "Task1"
123, "Task2"
123, "Task3"
123, "Task4"
123, "Task5"
456, "Task2"
456, "Task3"
456, "Task4"
789, "Task3"
789, "Task4"
789, "Task5"
At this point my gut feeling is that something has gone horrifically wrong with my thinking. I've lost the ability to ensure that the order is maintained. A query for all tasks needed by any specific ID could result in any order. It's also storing a lot of redundant data. At least I got rid of the NULLs? But that's no good.
At this point, something else is bugging me, which probably should have been addressed earlier in design. But I am trying to teach myself, and learning as I go. So here I go, off on a tangent.
There's a lot of redundant textual data, as these Task descriptions are constant. So I was wondering how to best optimize that, to minimize disk usage, and increase speed, without cluttering code with too much scripting overhead. One idea I had was to create a table of enumerations.
Enumerations: ID (key), Task (string)
1, Task5
2, Task4
3, Task3
4, Task2
5, Task1
6, Task10
7, Task9
8, Task8
9, Task7
10, Task6
and so on.
Well, at least instead of a string being stored everywhere, I could store a much smaller integer. Even if they were in the worst case 64-bit integers, that's 8 bytes, still smaller than the strings I'd be storing. My code would read in the enumerations, store in run time, and use that to reference the strings.
Not sure if that's a valid technique, if there's a better way to approach that problem, or even what it is called. Indexing? Or is that something different? Or is it something some databases can do automagically?
Anyways, back to the main problem, what to do with my arbitrary list of order-dependent tasks? Create 1-off tables per main record, each with it's own ORDER (key) and Task (string/int/enum) entry? Seems even worse for overhead.
It seems to me like this is a basic problem, and has a few standard ways of approaching it. On my limited budget, lack of books, slow connection, and Google endlessly sending me nowhere, I figured I'd ask for any tips. Any free online references to sources of knowledge (specific sites or articles) also welcome.