I am working on an analytics application which contains data from crawled websites.
Right now the crawled data is stored in separate rows, causing a lot of duplication.
I will give an example to explain the situation:
The crawler goes to Google.com and Yahoo.com, and figures out the main keywords on each website.
For Google the keywords are "Search" and "Internet", and for Yahoo the keywords are "Search" and "News". (The accuracy of the keywords don't matter in this example).
These keywords are then inserted into the keywords
table:
KEYWORDS
---------------------------
ID
Keyword
Created_at
insert into keywords (Keyword, Created_at) values ('Search', now)
insert into keywords (Keyword, Created_at) values ('Internet', now)
insert into keywords (Keyword, Created_at) values ('Search', now)
insert into keywords (Keyword, Created_at) values ('News', now)
So as you can see, the keywords for both Google and Yahoo were inserted into the database. But the problem is 'Search' is now in the database twice.
For analytics purposes, I need to keep a record that both Google and Yahoo had the keyword 'Search' (to keep this example simple I am excluding any foreign keys), so I can't simply not add 'Search' if it already exists in the database.
So this is my question:
I would like to only store the keyword once (the first time it is found), and I also need a corresponding record somewhere which keeps track of every time it is found.
So it's almost like I need something like this:
KEYWORDS KEYWORDS_FREQUENCY
--------------------------- ---------------------------
ID ID
Keyword (unique) Keyword_ID
Created_at
But I'm not sure if I'm confused and thinking about this in a wrong way. I'm also concerned analytics queries on this will become very slow as there needs to be joins, whereas the simple redundant design (the original design described above) would probably be quite fast as it's one big dumb table.
If you can understand what I am trying to achieve, could you please give me advice on how to design this?
Thank you.