How to store URLs in MySQL

Question

I need to store potentially 100s of millions URLs in a database. Every URL should be unique, hence I will use ON DUPLICATE KEY UPDATE and count the duplicate URLs.

However, I am not able to create an index on the URL field as my varchar field is 400 characters. MySQL is complaining and saying; "#1071 - Specified key was too long; max key length is 767 bytes". (Varchar 400 will take 1200 bytes)

What is the best way to do this, if you need to process minimum 500000 URLs per day in a single server?

We are already thinking using MongoDB for the same application, so we can simply query MongoDB and find the duplicate URL, and update the row. However, I am not in favor of solving this problem using MongoDB , and I'd like to use just MySQL at this stage as I'd like to be as lean as possible in the beginning and finish this section of the project much faster. (We haven't played with MongoDB yet and don't want to spend time at this stage)

Is there any other possibility doing this using less resources and time. I was thinking to get MD5 hash of the URL and store it as well. And I can make that field UNIQUE instead. I know, there will be collision but it is ok to have 5-10-20 duplicates in the 100 million URLs, if that's the only problem.

Do you have any suggestions? I also don't want to spend 10 seconds to insert just one URL, as it will process 500k URLs per day.

What would you suggest?

Edit: As per the request this is the table definition. (I am not using MD5 at the moment, it is for testing)

mysql> DESC url;
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| Field       | Type                  | Null | Key | Default           | Extra                       |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
| url_id      | int(11) unsigned      | NO   | PRI | NULL              | auto_increment              |
| url_text    | varchar(400)          | NO   |     |                   |                             |
| md5         | varchar(32)           | NO   | UNI |                   |                             |
| insert_date | timestamp             | NO   |     | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| count       | mediumint(9) unsigned | NO   |     | 0                 |                             |
+-------------+-----------------------+------+-----+-------------------+-----------------------------+
5 rows in set (0.00 sec)

It's unlikely to get 5 collisions in 100,000,000 URLs given that there're 3.4 * 10^38 possible MD5 hashes... And there are larger hashes out there. — Álvaro González, Jul 23 '11 at 14:17
If you really think the MD5 collisions will be a problem, use an unique index with both the MD5 and a substring from the start of the URL. — Mika Tähtinen, Jul 23 '11 at 14:26
@merinn Define *process URL*. What types of queries do you need to do on those URLs actually? Just to store? Maybe you just could normalize it to shorter notation first? — takeshin, Jul 23 '11 at 14:45
@takeshin, some of those URLs will be fetched by another application (about 1%-3%), also we will use them as links on the frontend application. — merinn, Jul 23 '11 at 14:49
This question has the best answer, however it's a duplicate: http://stackoverflow.com/questions/4191587/datatype-for-url — Francisco Presencia, Nov 18 '12 at 19:51

score 10 · Accepted Answer · edited Oct 07 '21 at 05:51

According to the DNS spec the maximum length of the domain name is :

The DNS itself places only one restriction on the particular labels
that can be used to identify resource records. That one restriction
relates to the length of the label and the full name. The length of
any one label is limited to between 1 and 63 octets. A full domain
name is limited to 255 octets (including the separators).

255 * 3 = 765 < 767 (Just barely :-) )

However notice that each component can only be 63 characters long.

So I would suggest chopping the url into the component bits.

Using http://foo.example.com/a/really/long/path?with=lots&of=query&parameters=that&goes=on&forever&and=ever

Probably this would be adequate:

protocol flag ["http" -> 0 ] ( store "http" as 0, "https" as 1, etc. )
subdomain ["foo" ] ( 255 - 63 = 192 characters : I could subtract 2 more because min tld is 2 characters )
domain ["example"], ( 63 characters )
tld ["com"] ( 4 characters to handle "info" tld )
path [ "a/really/long/path" ] ( as long as you want -store in a separate table)
queryparameters ["with=lots&of=query&parameters=that&goes=on&forever&and=ever" ] ( store in a separate key/value table )
portnumber / authentication stuff that is rarely used can be in a separate keyed table if actually needed.

This gives you some nice advantages:

The index is only on the parts of the url that you need to search on (smaller index! )
queries can be limited to the various url parts ( find every url in the facebook domain for example )
anything url that has too long a subdomain/domain is bogus
easy to discard query parameters.
easy to do case insensitive domain name/tld searching
discard the syntax sugar ( "://" after protocol, "." between subdomain/domain, domain/tld, "/" between tld and path, "?" before query, "&" "=" in the query)
Avoids the major sparse table problem. Most urls will not have query parameters, nor long paths. If these fields are in a separate table then your main table will not take the size hit. When doing queries more records will fit into memory, therefore faster query performance.
(more advantages here).

Note that the `4 characters` to handle TLDs is not good practice. From http://stackoverflow.com/questions/9238640/how-long-can-a-tld-possibly-be could be up to 63 chars, current largest is 24. — Eborbob, Aug 18 '15 at 11:53
Also, storing the `?key=value` params in a key/value store would be a bad idea since sometimes the *order* of params maters to a website. When you want to pull the full URL back out of the database you might get a different result when you put the URL back together. — Xeoncross, Nov 14 '16 at 17:48

score 0 · Answer 2 · answered Nov 05 '13 at 21:04

0

The odds of a spurious collision with MD5 (128 bits) can be phrased this way:

"If you have 9 Trillion different items, there is only one chance in 9 Trillion that two of them have the same MD5."

To phrase it another way, it is more likely to be hit by a meteor while winning the mega-lottery.

answered Nov 05 '13 at 21:04

Rick James

135,179
13
127
222

I'm think that's incorrect: http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem – endolith Aug 02 '14 at 00:44

Saic Siquot · Answer 3 · 2011-07-27T20:43:58.733

0

To index a field up to 767 chars wide, it charset must be ascii or similar, it can´t be utf8 because it uses 3 bytes per char, so the maximun wide for indexed utf-8 fields is 255

Of course, an 767 ascii url field, excedes your initial 400 chars spec. Of course, some urls excedes the 767 limit. Perhaps you can store and index on the first 735 chars plus the md5 hash. You can also have a text full_url field to preserve original value.
Notice that ascii charset is good enough for urls

edited Jul 27 '11 at 20:43

answered Jul 27 '11 at 19:51

Saic Siquot

6,513
5
34
56

see a good alternative method [here](http://stackoverflow.com/questions/5147867/best-way-to-store-url-in-mysql-for-a-readwrite-intensive-application/5148006#5148006) post by @razzed – Saic Siquot Jul 27 '11 at 21:16

score 0 · Answer 4 · answered Jul 27 '11 at 20:54

A well formed URL can only contain characters within the ASCII range - other characters need to be encoded. So assuming the URLs you intend to store are well formed (and if they are not, you may want to fix them prior to inserting them to the database), you could define your url_text column character set to ASCII (latin1 in MySQL). With ASCII, one char is one byte, and you will be able to index the whole 400 characters like you want.

score -1 · Answer 5 · answered Jul 23 '11 at 14:43

-1

You can change the url_text from VarChar(400) to Text, then you can add a full text index against it allowing you to search for the existence of the URL before you insert it.

answered Jul 23 '11 at 14:43

woot586

3,906
10
32
40

Some URL’s can be up to 2083 characters long e.g. IE8. In normal practice URL’s would not be this long but you should be prepared for them. See discussion http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-a-url – woot586 Jul 23 '11 at 14:57
I mean not the length of URL, but the usefulness of fulltext index for URLs :) – Karolis Jul 23 '11 at 15:04
1

The fulltext for URL is nonsense, because you don't search fulltext, you compare the whole strings. And that's why he has the md5 sum there, to not to insert the same URL twice. – Tomas Jul 23 '11 at 15:23
The FullText index will not enforce uniqueness which is the specific requirement of the index here... – Brendan Bullen Jul 25 '11 at 12:48
As per the comment thats why you search against it before you do a insert. – woot586 Jul 25 '11 at 14:17
Text is stored separately from the rest of the row (more disk i/o). Text is for unstructured data. A url is structured and this structure can be leveraged. – Pat Oct 19 '12 at 18:13

How to store URLs in MySQL

5 Answers5

Linked