How to remove duplicate rows without losing the ability to do analytics on the count of those duplicate rows?

Question

I am working on an analytics application which contains data from crawled websites.

Right now the crawled data is stored in separate rows, causing a lot of duplication.

I will give an example to explain the situation:

The crawler goes to Google.com and Yahoo.com, and figures out the main keywords on each website.

For Google the keywords are "Search" and "Internet", and for Yahoo the keywords are "Search" and "News". (The accuracy of the keywords don't matter in this example).

These keywords are then inserted into the keywords table:

KEYWORDS
---------------------------
ID
Keyword
Created_at

insert into keywords (Keyword, Created_at) values ('Search', now)
insert into keywords (Keyword, Created_at) values ('Internet', now)
insert into keywords (Keyword, Created_at) values ('Search', now)
insert into keywords (Keyword, Created_at) values ('News', now)

So as you can see, the keywords for both Google and Yahoo were inserted into the database. But the problem is 'Search' is now in the database twice.

For analytics purposes, I need to keep a record that both Google and Yahoo had the keyword 'Search' (to keep this example simple I am excluding any foreign keys), so I can't simply not add 'Search' if it already exists in the database.

So this is my question:

I would like to only store the keyword once (the first time it is found), and I also need a corresponding record somewhere which keeps track of every time it is found.

So it's almost like I need something like this:

KEYWORDS                          KEYWORDS_FREQUENCY                    
---------------------------       ---------------------------
ID                                ID
Keyword (unique)                  Keyword_ID
                                  Created_at

But I'm not sure if I'm confused and thinking about this in a wrong way. I'm also concerned analytics queries on this will become very slow as there needs to be joins, whereas the simple redundant design (the original design described above) would probably be quite fast as it's one big dumb table.

If you can understand what I am trying to achieve, could you please give me advice on how to design this?

Thank you.

Do you want to also store the origin of the keyword or just how many times it occurred during crawling? — S.Karras, Oct 20 '16 at 06:21
Thanks for your message. I need to store the origin of the keyword also. — Tom Brock, Oct 20 '16 at 06:30

score 0 · Answer 1 · answered Oct 20 '16 at 23:53

0

Add the origin as a column eg:

KEYWORDS
---------------------------
ID
Keyword
Created_at
Origin

insert into keywords (Keyword, Created_at, Origin) values ('Search', now, 'Google')
insert into keywords (Keyword, Created_at, Origin) values ('Internet', now, 'Google')
insert into keywords (Keyword, Created_at, Origin) values ('Search', now, 'Yahoo')
insert into keywords (Keyword, Created_at, Origin) values ('News', now, 'Yahoo')

answered Oct 20 '16 at 23:53

Taryn East

27,486
9
86
108

Doesn't this just make things worse? There is even more duplication now. – Tom Brock Oct 21 '16 at 03:11
It depends what you mean by duplication. Now you can pull out all the unique keywords (using "SELECT DISTINCT keyword FROM keywords") or you can pull out all the keywords for a given service (using "SELECT keyword FROM keywords WHERE Origin = 'Google'") or count them that way too. If you need something different to this - then please update your question and explain what you mean by duplication and why you need this? Especially given that in your comments, you talked about needing the origin stored... and that's exactly what I've suggested ;) – Taryn East Oct 21 '16 at 03:16
Hi Taryn, I do explain it in my question. "So as you can see, the keywords for both Google and Yahoo were inserted into the database. But the problem is 'Search' is now in the database twice." I also state this: "I would like to only store the keyword once (the first time it is found), and I also need a corresponding record somewhere which keeps track of every time it is found." I think maybe you were answering @S.Karras's question. – Tom Brock Oct 21 '16 at 03:21
Yes... I did read that - and I addressed the problem you're actually having, rather than by implementing the solution you're suggesting. You haven't explained *why* you want only one instance of the keyword in this particular table... and I'm not sure why my solution won't solve both your problems without requiring you proposed other-table. This is a simpler solution than that that actually solves the issue (needing unique set of keywords, also recording the origin). If you need this other table for another purpose... then give a *concrete* example of it, so we can see why. – Taryn East Oct 21 '16 at 03:38
Note that just because it's clear in your head what you need... doesn't mean that you've explained it clearly enough for it to be the same inside our heads... the fact that my solution apparently isn't a solution for you - means you haven't explained clearly enough what you need :) otherwise I wouldn't have proposed it. – Taryn East Oct 21 '16 at 03:39
No, the problem I'm having is this: "the problem is 'Search' is now in the database twice". And I need this solution: "I would like to only store the keyword once (the first time it is found), and I also need a corresponding record somewhere which keeps track of every time it is found." I'm not going to argue with you. – Tom Brock Oct 21 '16 at 03:40
1

Yes but *why* is it a problem that search is in the db twice? – Taryn East Oct 21 '16 at 03:41
1

We have many queries which lookup keywords to use. For example, we will shortly be implementing a tagging system that is based upon these keywords. Of course we can do some sort of .distinct('Keyword') query on the Keyword table, but it has a few hundred million rows. Therefore I think it makes sense to separate the actual keyword data ("Search", etc.) from the analytics data (created_at, etc.) – Tom Brock Oct 21 '16 at 03:48
1

ok, that makes a lot of sense. So yes add a uniqueness constraint on the keywords table, and add a table that contains "instances of keyword being used" with columns "keyword_id (references keywords.id), origin, datetime" or similar. – Taryn East Oct 21 '16 at 04:05

score 0 · Answer 2 · answered Oct 21 '16 at 05:27

TL;DR You have not justified introducing string ids, but avoiding duplicates is always a good start.

It is not a problem per se to have "redundancy" in the sense of the same value appearing multiple times in a database. In your own proposal, introduced id values appear in exactly the places that their correponding string values used to appeared, so having also added a mapping table from id to string you have actually added inessential data to the database, which is clearly "redundant" in another sense. (Which includes consequent "redundant" joins.) Meaningful redundancy arises when multiple rows overlap in what they state about the application situation, and problematic redundancy of that kind arises when it is not controlled by organizing according to good design principles.

Your concern with ids seem to be trying to address anticipated performance problems that you are solving via data compression. For your particular DBMS and patterns of update & querying, one design might have some kind of implementation performance problem that might be improved (at the expense of complexity, maintainability and merely different tradeoffs of time and space) by another. But assuming it is unjustified.

You probably either want a unique identifier for each occurrence of a value (typically using auto-increment id DBMS facilities), or a count of its occurrences. The benefit of this vs allowing duplicates is that your tables are relations, because DBMS implementation of querying data in terms of generic properties expressed by logical conditions is based on tables being relations. (Essentially, to query using logical conditions you have to convert a representation with duplicates to one without, then query, then if wanted convert back to a representation with duplicates.) On the other hand performance, again, may lead to your storing lists of values rather than relations, with inserts being faster at the expense of queries being slower.

Always start with the most straightforward design. A relational table holds the rows that make a true statement (proposition) from that tables fill-in-the-blanks sentence template (predicate). Choose sufficient tables/predicates to be able to describe any application situation that can arise. Optimize when a problem is demonstrated, and demonstrate that your "optimization" is "better".

How to remove duplicate rows without losing the ability to do analytics on the count of those duplicate rows?

2 Answers2

Linked