Understanding Normalization & Duplicates - I Guess I Don't - Adding Artist & Title Ids

Question

I began with a table listing the top 100 songs by date for the years 1958 through 1980. For each date, there are 100 records. Obviously many will be duplicates as a song changes position from week to week. Also, the artists will be duplicated (think Elvis) numerous times. There are ~ 116,000 records in the table.

This table had the following fields

uniq,
date,
artist,
title,
position

To eliminate duplicates (normalization as I understand it) I have modified the table so that it now looks like this

uniq,
date,
artistcode,
titlecode,
position

And have two new tables artists and titles. Artists looks like this

artist,
artistcode

And titles looks like this

title,
titlecode

To get started in the right direction, I simply want to reassemble (join) these tables so that I have a view that looks like the original table, ie

uniq,
date,
artist,
title,
position

and has those 116000 records. After reading a couple of books and working with several tutorials, I have come to the conclusion that I have a misconception of what normalization should do, or I am simply headed in the wrong direction.

The SQL syntax to create the view would be much appreciated.

The view is easy to answer. Please clearly say as much as you can about what you mean by "duplicates" and what problem you see with them and what you think "normalization" means. (Seems "eliminate duplicates" means replace "duplicates" by associated ids. Seems the problem might be "needs [sic] data compression".) — philipxy, Jun 15 '17 at 00:37
Unfortunately this is two entirely different questions. (Your title & your last line--even if they share most of the rest.) (It would help if you posted them separately.) Glad you got your view (& you could accept amflare's answer.) Did you get satisfaction re "duplicates" & "normalization"? — philipxy, Jun 18 '17 at 21:34

score 1 · Answer 1 · answered Jun 13 '17 at 20:53

1

To get back to the original output with the multiple tables, you can use the following syntax with JOINs

SELECT s.uniq, s.date, a.artist, t.title, s.position
FROM songs AS s
JOIN artists AS a ON a.artistcode = s.artistcode
JOIN titles AS t ON t.titlecode = s.titlecode

If you are trying to eliminate duplicate song entries, you can add this to the query:

GROUP BY t.title

answered Jun 13 '17 at 20:53

amflare

4,020
3
25
44

Adding that GROUP is invalid because the selected columns are not single-valued per title. – philipxy Jun 14 '17 at 08:11
I'm not sure what you mean by "single-value", but using `s.titlecode` instead will probably fix it. – amflare Jun 14 '17 at 13:35
Mysql requires all items in a SELECT statement to also appear in the GROUP BY statement unless sql_mode='' (or some other group by mode). When I think through it, it does make sense to include all fields in the GROUP BY clause – Dave Davis Jun 14 '17 at 18:30
This answer solved my problem. I was close, but ddn't have it quite right. Thanks. – Dave Davis Jun 14 '17 at 18:32
@DaveDavis & amflare There can be duplicate song, artist, song/artist, date, date/position etc etc subrows. But it's not clear what "eliminate duplicate song entries" or "include all fields" are supposed to mean. If you don't drop and/or add some columns, (legal SQL) GROUP BY has no effect. – philipxy Jun 15 '17 at 00:47
@dave-davis MySQL with default settings does NOT require all non-aggregating columns of the select clause to be in the group by clause. Whilst we may wish this was not true, regrettably it is. – Paul Maxwell Jun 18 '17 at 22:08

philipxy · Answer 2 · 2017-06-18T21:59:01.953

What "duplicates"? There is nothing wrong per se with the same value appearing multiple times. You need to begin reading some academic textbook(s)/slides/course(s) about information modeling and relational databases.

Each row that is in or not in a table makes a statement about the situation. The sort of "duplicate" and "redundancy" problems that normalization addresses arise sometimes when multiple rows of the same table say the same thing about the situation. (Which might or might not involve subrow values appearing multiple times.)

Eg: If you had a table like this one but with an additional column and a given artist/title combination always appeared with the same value in that column (like if an artist never has multiple recordings with the same title charting and you added the playing time of each recording) then there would be a problem. ("... AND recording artist/title is time minutes long") If you had a table like this one but with an additional column and a value in it always appeared with the same artist/title combination (like if you added a recording id) then there would be a problem. ("... AND recording recordingcode is of title title by artist artist") Right now there is no problem. What do you expect as an answer? The answer is, normalization says there's no problem, and your impressions are not informed by normalization.

Normalization does not involve replacing values by ids. Introduced id values have exactly the same pattern of appearances as the values they identify/replaced, so that doesn't "eliminate duplicates", and it adds more "duplicates" of the ids in new tables. The original table as a view is a projection of a join of the new tables on equality of ids. (You might want to have ids for ease of update or data compression (etc) at the expense of more tables & joins (etc). That's a separate issue.)

-- hit `uniq` is title `title` by artist `artist` at position `position` on date `date`
/* FORSOME h.*, a.*, t.*,
    hit h.uniq is title with id h.titlecode by artist with id h.artistcode
        at position h.position on date h.date
AND artist a.artist has id a.artistcode AND h.artistcode = a.artistcode
AND title t.title has id t.titlecode AND h.titlecode = a.title
AND `uniq` = h.uniq AND `title` = t.title AND `artist` = a.artist
    AND `position` = h.position AND `date` = h.date
*/
/* FORSOME h.*, a.*, t.*,
    Hit(h.uniq, h.titlecode, h.artistcode, h.position, h.date)
AND Artist(a.artist, a.artistcode) AND h.artistcode = a.artistcode
AND Title(t.title, t.titlecode) AND h.titlecode = a.title
AND `uniq` = h.uniq AND `title` = t.title AND `artist` = a.artist
AND `position` = h.position AND `date` = h.date
*/
create view HitOriginal as
select h.uniq, h.date, a.artist, t.title, h.position
from Hit h
join Artist a on h.artistcode = a.artistcode
join Title t on h.titlecode = t.titlecode

Not sure I follow. If I have a list of top100 songs by chart date, the same song/artist combination is going to appear multiple times as a song moves up or down thechart. You are correct that an id simply replacing a song doesn't do much but doesn't it make it easier to, for instance, update the spelling of the artist? Suppose I misspelled Beetles. By using an id for the artist and relating it to a series of songs, I can fix the spelling of Beatles by updating one record. — Dave Davis, Jun 14 '17 at 18:15
Yes "the same song/artist combination is going to appear multiple times". *So what?* (See my answer & links.) For a more specific answer please edit your question to clarify "duplicate" precisely and if you think you see a problem be specific. Pin down impressions. To know when and why there are design problems you need a textbook, because the explanation is *complex*. Eg read & apply a normalization chapter. PS Yes you might want to have ids for the ease of update (etc) at the expense of more tables & joins (etc). But you don't need to. And it has nothing to do with normalization. — philipxy, Jun 14 '17 at 22:23

Understanding Normalization & Duplicates - I Guess I Don't - Adding Artist & Title Ids

2 Answers2

Linked