2

I've been putting together a little SQLite database to track the top 100 songs from the iTunes RSS feed. I've built the script in Bash to do all the hard work and it's working finally, but I'm not sure if my database structure is correct, so I'm looking for some feedback on the best way to go as I am only learning SQL as I go at the moment so I don't want to dig myself into a hole when it comes to building the queries to retrieve data in time!

I have 3 tables like so;

artists_table

artist_id - PK
artist_name

songs_table

song_id - PK
artist_id - FK (from the artists table)

charts_table

chart_id - PK
song_id - FK (from the songs table)
position - (chart position 1-100)
date - (date of chart position xxxx-xx-xx)

The artists and songs table seem good to me, got foreign key constraint working...etc but I'm not sure about the charts table, anything obviously wrong with this structure?

I want to track songs/artists/positions over time so I can generate some stats...etc

Thanks,

DreadPirateShawn
  • 8,164
  • 4
  • 49
  • 71
DerekOS
  • 57
  • 1
  • 9
  • 1
    What is the purpose ? (a) Are you tracking, in order, artists; artist_songs; artist_song_chart_positions, for a chosen group of artists ? Or are you tracking charts, on a (eg) daily basis, and recording the position of artist_songs ? – PerformanceDBA Apr 29 '15 at 01:07
  • 1
    I think I'll be tracking charts on a daily basis and recording the position of artist_songs. I'll be using the info to automatically "tag" songs in a music library as hot, descending, ascending, new and recurrent maybe once a week. So as songs rise at a certain rate over time they will be tagged ascending, if they are newly entered into the charts then tagged as new..etc. I would like to be able to do some calculations on their positions over time, if that makes sense. – DerekOS Apr 29 '15 at 10:30
  • 1
    It makes sense. (a) So the main (only ?) input is chart info. Ie. You don't have a library of Artists and Songs, those items get inserted as a result of a either a song or artist being placed on a chart, correct ? The artist & song library starts at zero and gets built up over time. (b) how do you ensure the artists, songs, charts are unique in those tables ??? – PerformanceDBA Apr 29 '15 at 11:31
  • Yes exactly. (a) yes. (b) I have built the check into the bash script itself. It pulls the RSS feed, formats it the way I want it to a text file and then goes ahead inserting the artist, song and position into the db. It checks if the artist already exists, if it does then retrieves it's unique id, then checks if the song exists by that artist, if it does, retrieves it's id, checks if chart position by artist and song for today's date exists, if it doesn't then records it in the charts table. It won't insert duplicate artists or songs by the same artist, took a while, but that part is solid. – DerekOS Apr 29 '15 at 11:37
  • 1
    (b.1) How exactly does it *check if the artist[song] already exists* ? (b.2) How do you know that there is NOT more than occ of a specific artist/song on file ? – PerformanceDBA Apr 29 '15 at 11:40
  • It runs in the order of artist, song, position. So it first checks if the artist exists, if yes, i store the artist id in a variable, then checks if the song name exists in the song table with the artist id from the stored variable, if no, then it creates a new song record and inserts the artist id as a foreign key in the songs table. – DerekOS Apr 29 '15 at 11:45
  • For example, Taylor swift has 1 artist record but 3 song records associated with her artist id currently. The 3 songs have the PK from Taylor swift as its foreign key. – DerekOS Apr 29 '15 at 11:47
  • Please review the *tentative* Answer. – PerformanceDBA Apr 29 '15 at 12:28

2 Answers2

4

Initial Response

I ask you about the data, in order to answer your Question, but you keep telling me about the process. No doubt, that is very important to you. And now you wish to ensure that the Record Filing System is correct.

Personally, I never write a line of code until I have the database designed. Partly because I hate to rewrite code (and I love to code). You have the sequence reversed, an unfortunate trend these days. Which means, whatever I give you, you will have to rewrite large chunks of your code.

(b.1) How exactly does it check if the artist[song] already exists ?

(b.2) How do you know that there is NOT more than occ of a specific artist/song on file ?

Right now, given the details in your Question, let's say that you have incoming, that Pussycat Dolls place 66 on the MTV chart today:

    INSERT artist VALUES ( "Pussycat Dolls" )    -- succeeds, intended
    INSERT artist VALUES ( "Pussycat Dolls" )    -- succeeds, unintended
    INSERT artist VALUES ( "Pussycat Dolls" )    -- succeeds, unintended
  1. Exactly which Pussycat Dolls record placed 66th today ? When you RFS grows, and you have more fields in artist, eg. birth_date, which of the three records would you like to update ?

  2. Ditto for Song.

  3. How is Chart identified, is it something like US Top 40 ?

(b.1) How exactly does it check if the artist[song] already exists ?

When you execute code, it runs inside the sqLite program. What is the exact SQL string that you pass it ? Let's say you do this:

    SELECT $artist_id = artist_id
            FROM artist
            WHERE artist_name = $artist_name
    IF $artist_id = NULL
        INSERT artist VALUES ( $artist_name )

Then you have going to have a few surprises when the system goes "live". Hopefully this interaction will eliminate them. Right now you have a few hundred artists.

  • when you have a few thousand artists, the system will slow down to a snails pace.

  • when things go awry, you will have duplicate artists, songs, charts.

Record Filing System

Right now, you have a pre-1970's ISAM Record Filing System, with no Relational integrity, power, or speed.

If you would like to understand more about the dangers of an RFS, in todays Relational context, please read this Answer.

Relational Database

As I understand it, you want the integrity, power, and speed of a Relational Database. Here is what you are heading towards. Obviously, it is incomplete, unconfirmed, there are may details missing, many questions remain open. But we have to model the data, only as data (as opposed to what you are going to do with it, the process), and nothing but the data.

This approach will ensure many things:

  • as the data grows and is added to (in terms of structure, not population), the existing data and code will not change

  • you will have data and referential integrity

  • you can obtain each of your stats via a single SELECT command.

  • you can execute any SELECT against the data, even SELECTs that you are not capable of dreaming about, meaning unlimited stats. As long as the data is stored in Relational form.

A database is a collection of facts about the real world, limited to the subject area of concern. Thus far we don't have facts, we have a recording of an incoming RSS stream. And the recording has no integrity, there is nothing that your code can rely on. This is heading in the direction of facts:

First Draft Music Chart TRD (Obsolete due to progression, see below.)

Response to Comments 1

Currently, I am only tracking one chart, but I see in your model that it also has the ability to track several, that is nice!

Not really. It is a side-effect of Doing Things Properly. The issue here is one of Identification. A Chart Position is not identified by RSS Feed ID, or chart_table.id, plus a PositionNo plus a DateTime. No. A Chart Position is identified as US Top 100/27 Apr 15/1… The side effect is that ChartName is part of the Identifier, and that allows multiple Charts, with no additional coding.

In these dark days of IT, people often write systems for one Country, and implement a StateCode all over the place. And then experience massive problems when they open up to an international customer base. The point is, there is no such thing as a State that does not have a Country, a State exists only in the context of a Country. So the Identifier for State must include a Country Identifier, it is (CountryCode, StateCode). Both Australia and Canada have NT for a StateCode.

If I can explain how I store the data from the rss feed, it might clear things up somewhat.

No, please. This is about the data, and only the data. Please review my previous comments on that issue, and the benefits.

I am away from my main computer at the moment, but I will respond within the next couple of hours if thats ok.

No worries. I will get to it tomorrow.

Your model does make sense to me though,

That is because you know the data values intimately, but you do not understand the data, and when someone lays it out for you, correctly, you experience pleasurable little twitches of recognition.

I don't mind having to recode everything, its a learning curve!

That's because you put the cart before the horse, and coded against data laid out in a spreadsheet, instead of designing the database first and coding against that second.

If you are not used to the Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.

Response to Comments 2

Just one more quick question.

Fire away, until you are completely satisfied.

In the diagram, would there be any disadvantage to putting the artist table above the song table and making the song table a child of the parent artist instead? As artists can have many songs, but each song can only have 1 artist. Is there any need for the additional table to contain just the artistPK and songPK. Could I not store the artistPK into the songs table as a FK, as a song can only exist if there is an associated artist?

  1. Notice your attachment to the way you had it organised. I repeat:

A database is a collection of facts about the real world, limited to the subject area of concern.

Facts are logical, not physical. When those facts are organised correctly (Normalised, designed):

You can execute any SELECT against the data, even SELECTs that you are not capable of dreaming about, meaning unlimited stats. As long as the data is stored in Relational form.

When they aren't, you cant. All SQL (not only reports that are envisioned) against the data is limited to the limitations in the model, which boils down to one thing: discrete facts being recorded in logical form, or not.

With the TRD we have progressed to recording facts about the real world, limited only by the scope of the app, and not by the non-discretion of facts.

Could I not store the artistPK into the songs table as a FK, as a song can only exist if there is an associated artist?

In your working context, at this moment, that is true. But that is not true in the real world that you are recording. If the app or your scope changes, you will have to change great slabs of the db and the app. If you record the facts correctly, as they exist, not as limited to your current app scope, no such change will be necessary when the the app or your scope changes (sure, you will have to add objects and code, but not modify existing objects and code).

In the real world, Song and Artist are discrete facts, each can exist independent of the other. Your proposition is false.

  • Ave Maria existed for 16 centuries before Karen Carpenter recorded it.

  • And you already understand and accept that an Artist exists without a `Song.

Is there any need for the additional table to contain just the artistPK and songPK.

It isn't an "additional table to contain just the artistPK and songPK", it is recording a discrete fact (separate to the independent existence of Artist and Song), that a specific Artist recorded a specific Song. That is the fact that you will count on in theChartDatePosition`

Your proposition places Song as dependent on, subordinate to, Artist, and that is simply not true. Any and all stats (dreamed of or not) that are based on Song will have to navigate Artist::ArtistSong, then sort or ORDER BY, etc.

artists can have many songs, but each song can only have 1 artist.

That is half-true (true in your current working context, but not true in the real world). The truth is:

  • Each Artist is independent
    Each Song is independent
    Each Artist recorded 1-to-n Songs (via ArtistSong)
    Each Song was recorded by 1-to-n Artists (via ArtistSong)

For understanding, changing your words above to form correct propositions (as opposed to stating technically correct Predicates):

  • Artists can have many RecordedSongs
    Each RecordedSong can only have 1 Artist Each RecordedSong can only have 1 Song

So yes, there are disadvantages, significant ones.

Which is why I state, you must divorce yourself from the app, the usage, and model the data, as data, and nothing but data.

Solution 2

I have updated the TRD.

Second Draft Music Chart TRD

  • Courier means example data; blue indicates a Key (Primary is always first); pipe indicates column separation; slash indicates Alternate Key (only the columns that are not in the PK are shown); green indicates non-key.

  • I am now giving you the Predicates. These are very important, for many reasons. The main reason here, is that it disambiguate the issues we are discussing.

    • If you would like more information on Predicates, visit this Answer, scroll down (way down!) to Predicate, and read that section. Also evaluate that TRD and those Predicates against it.
  • The index on ChartDateSong needs explanation. At first I assumed:

       PK ( Chart, Date, Rank )
    

    But then for Integrity purposes, as well as search, we need:

       AK ( Chart, Date, ArtistId, SongId )
    

    Which is a much better PK. So I switched them. We do need both. (I don't know about NONsqLite, if it has clustered indices, the AK, not the PK should be clustered.)

       PK ( Chart, Date, ArtistId, SongId ) 
       AK ( Chart, Date, Rank )
    

Response to Comments 3

What about the scenario when a song enters the charts with the same song_name as a record in the song_table but is completely unrelated (not a cover, completely original, but just happens to share the same name)

In civilised countries that is called fraud, obtaining benefit by deception, but I will try to think in devilish terms for a moment and answer the question.

Well, if it happens, then you have to cater for it. How does the feed inform you of such an event ? I trust it doesn't. So then your Song Identifier is still the Name.

and instead of a unique song record being created, the existing song_id is added to the artistssongs_table with the artist id, wouldn't this be a problem?

We don't know any better, so it is not a problem. No one watching that feed knows any better either. If and when you receive data informing you of that issue, through whatever channel, and you can specify it, you can change it.

Normally we have an app that allows us to navigate the hierarchies, and to change them, eg. A ReferenceMaintenance app, with an Exporer-type window on the left, and combo dialogues (list of occs on top, plus detail of one occ on the bottom) on the right .

Until then, it is not a form of corruption, because the constraint that prevents such corruption is undefined. You can't be held guilty of breaking a law that hasn't been written yet. Except in rogue states.

Although a song can have the same name, it doesn't necessarily mean it's the same record.

Yes.

Wouldn't it be better to differentiate a song by the artist?

They are differentiated by Artist.

You do appreciate that the fact of a Song, and the fact of an Artist playing a song, are two discrete facts, yes ? Please question any Predicates that do not mean perfect sense, those are the propositions that the database supports.

  • Ave Maria exists as an independent fact, in Song

  • Karen Carpenter, Celine Dion, and Yours Truly exist as three independent facts, in Artist

  • Karen Carpenter-Ave Maria, Celine Dion-Ave Maria, and Yours Truly-Ave Maria exist as three discrete facts in ArtistSong.

  • That is seven separate facts, about one Song, about three Artists.

Response to Comments 4

I do understand it now. The artistsong_table is where the 2 items "meet" and a relationship actually exists and is unique.

Yes. I just wouldn't state it in that way. The term Fact has a technically precise meaning, over and above the English meaning.

A database is a collection of facts about the real world, limited to the subject area of concern.

Perhaps read my Response 3 again, with that understanding of Fact in mind.

  • Each ArtistSong row is a Fact. That depends on the Fact of an Artist, and the Fact of a Song. It establishes the Fact that that Artist recorded that Song. And that ArtistSong Fact is one that other Facts, lower in the hierarchy, will depend upon.

  • "Relationship ... actually". I think you mean "instance". The relationship exists between the tables, because I drew a line, and you will implement a Foreign Key Constraint. Perhaps think of Fact as an "instance".

Just to make sure I understand the idea correctly, if I were to add "Genre" into the mix, would I be correct in thinking that a new 'independent' table genre_table would be created and the artistsong_table would inherit its PK as an FK?

Yes. It is a classic Reference or Lookup table, the Relationship will be Non-identifying. I don't know enough about the music brothelry to make any declarations, but as I understand it, Genre applies to a Song; an Artist; and an ArtistSong (they can play a Song in a Genre that is different to the Song.Genre). You have given me one, so I will model that.

The consequence of that is, when you are inserting rows in ArtistSong, you will have to have the Genre. If that is in the feed, well and good, if not, you have a processing issue to deal with. The simple method to overcome that is, implement a Genre "", which indicates to you that you need to determine it from other channels.

It is easy enough to add a classifier (eg. Genre) later, because it is a Non-identifying Relationship. But Identifying items are difficult to add later, because they force the Keys to change. Refer para 3 under my Response 1.

You are probably ready for a Data Model:

Third Draft Music Chart Data Model

Community
  • 1
  • 1
PerformanceDBA
  • 32,198
  • 10
  • 64
  • 90
  • 2
    I really appreciate you taking the time to recommend a solution on this, you're obviously quite up to speed in relation to this! DBs are new to me obviously enough! – DerekOS Apr 29 '15 at 15:07
  • Currently, I am only tracking one chart, but I see in your model that it also has the ability to track several, that is nice! If I can explain how I store the data from the rss feed, it might clear things up somewhat. I am away from my main computer at the moment, but I will respond within the next couple of hours if thats ok. Your model does make sense to me though, I don't mind having to recode everything, its a learning curve! – DerekOS Apr 29 '15 at 15:30
  • @DerekOS. My pleasure. Other items answered in the post. – PerformanceDBA Apr 29 '15 at 16:25
  • 2
    Thanks again. I think this has been well and truly answered, now its time for me to do some more reading. I really like "you can execute any SELECT against the data, even SELECTs that you are not capable of dreaming about, meaning unlimited stats. As long as the data is stored in Relational form" as its probably the main reason I asked this question and I knew if the model wasn't correct from the start, I would have to start again from scratch. I do appreciate your help once again! – DerekOS Apr 29 '15 at 19:06
  • Just one more quick question. In the diagram, would there be any disadvantage to putting the artist table above the song table and making the song table a child of the parent artist instead? As artists can have many songs, but each song can only have 1 artist. Is there any need for the additional table to contain just the artistPK and songPK. Could I not store the artistPK into the songs table as a FK, as a song can only exist if there is an associated artist? – DerekOS Apr 29 '15 at 19:46
  • @DerekOS. You are most welcome. *Undreamt-of SELECTs*: Yes. *New questions*: answered in the post. Note, it has been re-sequenced, it is now in chronological order. – PerformanceDBA Apr 30 '15 at 02:53
  • What about the scenario when a song enters the charts with the same song_name as a record in the song_table but is completely unrelated (not a cover, completely original, but just happens to share the same name) and instead of a unique song record being created, the existing song_id is added to the artistssongs_table with the artist id, wouldn't this be a problem? Although a song can have the same name, it doesn't necessarily mean it's the same record. Wouldn't it be better to differentiate a song by the artist? Whereas artists as a rule don't share the same name in the charts. – DerekOS Apr 30 '15 at 10:56
  • @DerekOS. I have responded in the Answer. – PerformanceDBA Apr 30 '15 at 13:55
  • 2
    I do understand it now. The artistsong_table is where the 2 items "meet" and a relationship actually exists and is unique. As you said earlier it doesn't allow for the duplication of song_name in the songs_table which is the point of a relational database, why have 2 records in the song_table with the same name? doesn't make sense. Just to make sure I understand the idea correctly, if I were to add "Genre" into the mix, would I be correct in thinking that a new 'independent' table genre_table would be created and the artistsong_table would inherit its PK as an FK? – DerekOS Apr 30 '15 at 18:34
  • @DerekOS. *Genre* Yes, that would be a classic Lookup or Reference table. Raises two issues (a) whether the child is Artist, or Song, or both (each means a different thing) (b) whether each of those relationships is Identifying or Non-identifying. Please study the [**IDEF1X Notation**](http://www.softwaregems.com.au/Documents/Documentary%20Examples/IDEF1X%20Introduction.pdf) doc carefully. The former changes the data hierarchy, the latter doesn't. – PerformanceDBA May 01 '15 at 02:40
  • @DerekOS. Better to identify and add all those items now. The point of modelling is to get the Relational Keys right, because that reflects the real world, and thus the structure is stable, it will not change with every enhancement. – PerformanceDBA May 01 '15 at 02:43
  • @DerekOS. I have responded in the Answer. Data Model posted. – PerformanceDBA May 01 '15 at 10:35
  • 2
    Ah, I see how its non-identifying now as its key will not form part of the child's PK, so it will be unaffected. Wow, you have given me a lot to think about and do, I do appreciate it very much. Now that I know my data is modelled correctly, I will work on entering it correctly, then the hard part will be figuring out what feedback I want from it. Thank you again, you have been more than helpful! I think its time to mark this answered! – DerekOS May 01 '15 at 16:49
  • 1
    @DerekOS. You are most welcome. Keep studying the IDEF1X Intro, people tell me that they get more out of it on the second reading, eg. during implementation. You probably won't need this [**IDEF1X Anatomy**](http://www.softwaregems.com.au/Documents/Documentary%20Examples/IDEF1X%20Model%20Anatomy.pdf) that I just did for someone else. Yes, please, mark it answered *and* votes. – PerformanceDBA May 03 '15 at 04:18
1

It all depends on the relationships (one-to-one, one-to-many, many-to-many) your data is going to have.

The way you implemented your charts table indicates that:

  • Each chart has only/belongs to one song
  • A song can have many charts

It is a one-to-many relationship. And if that was what you intended then everything seems fine.

However:

  1. If your charts can have many songs and a song will have only one chart (also a one-to-many relationship but reversed), the song_id column needs to be taken out from the charts table and the songs table needs chart_id column in.
  2. If your charts can have many songs and your songs can have many charts as well (many-to-many relationship), then you need a "joint table" which could be something like this: TABLE: charts_songs, COLUMNS: id, chart_id, song_id, position
  • I'm becoming even more lost now! I see what you mean by the relationships, I'm just not sure where this fits in either example now. Basically I pull the chart once every day, so I will have artists, songs and chart positions. I already have the coding in place not to duplicate artists or songs, only if they're unique, so the artists and songs table will grow over time. Each day I pull the top 100, I need to record this somehow, so that lets say in 1 months time I can see which artist had song/s in the top 10 over 28 days for example, maybe a table for each day? Lost... – DerekOS Apr 28 '15 at 22:04