-1

Which of these methods would be the most efficient way of storing, retrieving, processing and searching a large (millions of records) index of stored URLs along with there keywords.

Example 1: (Using one table)

TABLE_URLs-----------------------------------------------
ID        DOMAIN        KEYWORDS
1         mysite.com    videos,photos,images
2         yoursite.com  videos,games
3         hissite.com   games,images
4         hersite.com   photos,pictures
---------------------------------------------------------

Example 2: (one-to-one Relationship from one table to another)

TABLE_URLs-----------------------------------------------
ID        DOMAIN        KEYWORDS
1         mysite.com
2         yoursite.com 
3         hissite.com
4         hersite.com
---------------------------------------------------------

TABLE_URL_KEYWORDS---------------------------------------------
ID        DOMAIN_ID     KEYWORDS
1         1             videos,photos,images
2         2             videos,games
3         3             games,images
4         4             photos,pictures
---------------------------------------------------------

Example 3: (one-to-one Relationship from one table to another (Using a reference table))

TABLE_URLs-----------------------------------------------
ID        DOMAIN
1         mysite.com
2         yoursite.com
3         hissite.com
4         hersite.com
---------------------------------------------------------

TABLE_URL_TO_KEYWORDS------------------------------------
ID        DOMAIN_ID     KEYWORDS_ID
1         1             1
2         2             2
3         3             3
4         4             4
---------------------------------------------------------

TABLE_KEYWORDS-------------------------------------------
ID        KEYWORDS
1         videos,photos,images
2         videos,games
3         games,images
4         photos,pictures
---------------------------------------------------------

Example 4: (many-to-many Relationship from url to keyword ID (using reference table))

TABLE_URLs-----------------------------------------------
ID        DOMAIN
1         mysite.com
2         yoursite.com
3         hissite.com
4         hersite.com
---------------------------------------------------------

TABLE_URL_TO_KEYWORDS------------------------------------
ID        DOMAIN_ID     KEYWORDS_ID
1         1             1
2         1             2
3         1             3
4         2             1
5         2             4
6         3             4
7         3             3
8         4             2
9         4             5
---------------------------------------------------------

TABLE_KEYWORDS-------------------------------------------
ID        KEYWORDS
1         videos
2         photos
3         images
4         games
5         pictures
---------------------------------------------------------

My understanding is that Example 1 would take the largest amount of storage space however searching through this data would be quick (Repeat keywords saved multiple times, however keywords are sat next to the relevant domain)

wWhereas Example 4 would save a tons on storage space but searching through would take longer. (Not having to store duplicate keywords, however referencing multiple keywords for each domain would take longer)

Could anyone give me any insight or thoughts on which the best method would be to utilise when designing a database that can handle huge amounts of data? With the foresight that you may want to display a URL with its assosicated keywords OR search for one or more keywords and bring up the most relevant URLs

GMB
  • 216,147
  • 25
  • 84
  • 135
PaulF
  • 124
  • 1
  • 8
  • Please read https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad and don't store comma separated data in your database – nbk Apr 20 '20 at 21:54

2 Answers2

2

You do have a many-to-many relationship between url and keywords. The canonical way to represent this in a relational database is to use a bridge table, which corresponds to example 4 in your question.

Using the proper data structure, you will find out that the queries will be much easier to write, and as efficient as it gets.

I don't know what drives you to think that searchin in a structure like the first one will be faster. This requires you to do pattern matching when searching for each single keyword, which is notably slow. On the other hand, using a junction table lets you search for exact matches, which can take advantage of indexes.

Finally, maintaining such a structure is also much easier; adding or removing keywords can be done with insert and delete statements, while other structures require you do do string manipulation in delimited list, which again is tedious, error-prone and inefficient.

GMB
  • 216,147
  • 25
  • 84
  • 135
0

None of the above.

Simply have a table with 2 string columns:

CREATE TABLE domain_keywords (
    domain VARCHAR(..) NOT NULL,
    keyword VARCHAR(..) NOT NULL,
    PRIMARY KEY(domain, keyword),
    INDEX(keyword, domain)
) ENGINE=InnoDB

Notes:

  • It will be faster.
  • It will be easier to write code.
  • Having a plain id is very much a waste.
  • Normalizing the domain and keyword buys little space savings, but at a big loss in efficiency.

"Huse database"? I predict that this table will be smaller than your Domains table. That is, this table is not your main concern for "huge".

Rick James
  • 135,179
  • 13
  • 127
  • 222