1

I have a table:

URL | last_crawled | worker_id | worker_assign_date

The url is obviously any url. The last_crawled is a date field that stores when the field was last crawled. Worker ID is a id for a worker that is currently assigned to this URL. And the worker_assign_date is when the URL was assigned to a crawler. This is so that if worker X takes too long to crawl a URL, I will just assign the URL back to another worker.

My question is, seeing as I will be sorting by worker_assign_date a lot and last_crawled a lot, what should that type be?

First priority is speed, which sorts faster? Int or DateTime?

Second priority is size. Does an int take up less space than DateTime or more?

Please note: The internet has roughly 5 billion pages. This database is to hold all 5 million URLs and make updates etc. I will use InnoDB so I can lock single rows only.

UPDATE

A unix timestamp, that will "work" until 2099, would be 4099770061, so storing it as INT(10) would be more than sufficient. According to the mysql documentation, that would take up 4 bytes. A datetime field will take up 8 bytes. So, it seems that timestamps are at least smaller. Is that true?

Also, the last question then remains, which is faster during a sort? Is there any difference?

rockstardev
  • 13,479
  • 39
  • 164
  • 296
  • Hey @coderama, I'd highly recommend datetime. The DBAs at my company recommend it and we maintain a large mySQL stack. Also, please refer to this stackoverflow article : http://stackoverflow.com/questions/12617347/when-do-we-chose-datetime-over-timestamp – Mark Madej Nov 07 '14 at 05:51
  • Thanks, but based on what grounds. I won't be using any date functions. It is PURELY for ordering. I have established datetime is bigger, but is it slower or faster? And if I don't need to use all the extra mysql functions, why would I use the bigger and slower of the two? – rockstardev Nov 07 '14 at 06:10
  • If it's only for ordering, I agree int would be better. However, you may also want to consider what your database will be used for in the future. You may not need date functions now, but inevitably your data store will grow bigger and you'll want to report on that data. It will be a lot easier if you can leverage mySQL's date functions in that case. – Mark Madej Nov 07 '14 at 06:12
  • I think that's especially true if you're building some kind of web crawler. One database row for one URL is going to add up FAST. – Mark Madej Nov 07 '14 at 06:13
  • Awesome. any idea which will sort by faster? I think the int, because it is smaller, but perhaps there is some benefit with date_time's i'm not aware of? – rockstardev Nov 07 '14 at 06:24

2 Answers2

3

I found an article that addresses your exact question. Int is significantly faster. This thread has a guy that ran performance tests that are hugely in favor of int over datetime.

MySQL Integer vs DateTime index

Community
  • 1
  • 1
Mark Madej
  • 1,752
  • 1
  • 14
  • 19
2

Partitions?

If this is one of these edge cases where you put a date in a database, and not want to do anything about it until later; then I would consider partition by one of date, month, day, year, ...etc.

http://dev.mysql.com/doc/refman/5.6/en/partitioning-overview.html

Integer or Datetime?

  • integer is a more primitive type and doesn't use as many bytes.
  • on a 32bit machine, comparing 4 bytes is faster than comparing 8 bytes.
  • keep in mind that resolution is important,
    • DATE is 3 bytes
    • YEAR alone is only 1 byte

Do your homework

Based on your workload (write intensive? read intensive?) and where you do the conversion from timestamp to integer, you may have some surprises. I wouldn't trust numbers I see out there and conduct my own experiment to make sure if fits my workload.

http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html http://dev.mysql.com/doc/internals/en/date-and-time-data-type-representation.html


If your end goal is to keep crawling the page you last crawled the longest time ago; you could imagine a ring where all rows have a number that increases on a regular basis and when the url gets crawled again, then you can reset that number to zero.

Partitions, cont'd.

If you used partitions and you know partition p1 has the oldest ones.

SELECT URL FROM mytable PARTITION (p1);
# crawl those.

Next time, imagine p2 has the oldest data.

SELECT URL FROM mytable PARTITION (p2);
# crawl those.

Boom, no order or sort needed.

How to choose a good partitioning scheme.

5 billion pages is a lot. You are not going to crawl that in one day (or I hope you are not using dial-up). If your moving window is a week, then there are roughly 52 weeks in a year; each week you get a new working set.

http://dev.mysql.com/doc/refman/5.6/en/partitioning-selection.html

dnozay
  • 23,846
  • 6
  • 82
  • 104
  • What benefits would that have for me? is it better to have a few smaller int fields, rather than one big field? Remember, only sorting! Will this not make it even harder as fields would have to be combined? – rockstardev Nov 07 '14 at 06:42
  • I suspect XY problem. Is the end goal to sort or to act on a given subset (partition?) of data based on some criteria (e.g. more than one month old)? – dnozay Nov 07 '14 at 06:44
  • The idea is to "sort by lastcrawled asc" so that I can get the URL that was crawled the longest time ago. Make sense? I'm open to any suggestion to make sure that the oldest records get ordered first. a last crawled date field just seemed to make sense. – rockstardev Nov 07 '14 at 06:46
  • I understand that part, but is there a purpose to order them that way? – dnozay Nov 07 '14 at 07:05
  • No. It only determines crawling order. There is a database that only stores links (5 billion rows in a mysql databse). Once a url is crawled, the result is passed to a solr database for indexing. So there are two indexes for the data. – rockstardev Nov 07 '14 at 07:07
  • do you ever crawl the same page again? do you update its timestamp then? – dnozay Nov 07 '14 at 07:14
  • Yes. Every time a page is crawled again, the timestamp is updated to when it was last crawled (i.e. now). So each row entry would age over time. Perhaps there is an inherent "row updated" field part of mysql that would be even faster? – rockstardev Nov 07 '14 at 07:35