5

I understand why the index order matters in Rails (from answers like these), for example, if I have:

add_index :admin_users_pages, ["user_id", "page_id"]

So I'm supposed to put the field that "narrows down the number of rows" fastest, but I'm not sure what does that mean. Say I have 2 users, with 2 unique IDs, and 300 pages, with 300 unique IDs, which one would be a smarter choice to put first? Say I have 150 pages for the first user and 150 pages for the second user, would the index look something like:

user_id page_id
1       1
1       2  
1       3

or the page_id won't be sorted at all, only the index so I should get something like:

user_id page_id
1       143
1       93  
1       31
Community
  • 1
  • 1
daremkd
  • 8,244
  • 6
  • 40
  • 66
  • Note that in the question you reference there is the statement, "The user_views table will always be queried to look for both columns, never only one". That sounds very dubious to me -- there will never be a "find all articles that this user has viewed", or "find all users who have viewed this article" query? – David Aldridge Aug 13 '15 at 13:27

2 Answers2

3

If for a given user you want to find its pages, use [:user_id, :page_id].

If for a given page you want to find its users, use [:page_id, :user_id].

If you want to do both, then create [:user_id, :page_id] and [:page_id, :user_id].

If you have a user_id and a page_id and you want to find that row (not a very likely situation, IMHO), then for a balanced tree index it doesn't matter which order you have chosen. The entries are sorted within the index for both the first and the second and subsequent columns.

In some situations it is arguable that the least selective should go first (for Oracle compressed b-tree indexes or for Oracle skip-scan access), but in general it really doesn't matter.

David Aldridge
  • 51,479
  • 8
  • 68
  • 96
  • Do [:user_id, :page_id] and [:page_id, :user_id] have the same b-tree depth? – lx00st Aug 13 '15 at 12:15
  • I don't think it makes any practical difference, as the depth is really driven by the number of entries per node, which is not affected by the order of the entries. The choice of indexes should be determined by the access paths to the data. – David Aldridge Aug 13 '15 at 13:21
  • This is an index for a join table with only these 2 keys as columns. So basically it's okay to have 2 indexes with same columns in a different order? – daremkd Aug 14 '15 at 10:09
  • Yes, because although it increases the overhead on making modifications to the table, that penalty is only paid once per modification. However the benefit, which is that the query can use the index without needing to touch the table, is received on every query of the data. The ratio of reads to writes is measurable on some systems (Oracle, through segment statistics for example), but you'd probably expect reads to exceed writes by *at least* an order of magnitude. – David Aldridge Aug 14 '15 at 10:13
1

In your case the selectivity of page_id will be better, because it narrows down number of rows extremely fast (down to 2). It means that if you are given a page_id than you can take 2 records from from table and then filter them by user_id, but if you have user_id then you will take 150 records and filter them. So it is better to put 'page_id' first.

lx00st
  • 1,566
  • 9
  • 17
  • You're talking about two different types of query here though -- *"if you are given a page_id than you can take 2 records from from table and then filter them by user_id"* -- one based on knowing the page_id, and one based on knowing the user_id. If that is the case then two separate indexes should be used. – David Aldridge Aug 12 '15 at 15:48
  • Its not about querying. Im trying to explain what "narrows down the rows" means – lx00st Aug 13 '15 at 12:12
  • But your talking about how many rows from the table will have to to be read and then filtered based on whether you know the page_id or the user_id -- those are different queries, and the question of having to access the table to filter out rows does not apply when both page_id and user_id are included in the composite index. – David Aldridge Aug 13 '15 at 13:30
  • You are right, this question is driven more by the way one query his data. What Im explaining is that if you filter `user_id` first than you have more tree traversing and less leaf nodes search which is good) – lx00st Aug 13 '15 at 13:45
  • What if the reverse was true (there were 300 users and 2 pages)? Will then creating another index with the columns reversed work? Is this a good idea? – daremkd Aug 14 '15 at 10:11
  • As @David Aldridge has said the way you index your data should reflect the way you query your data but you can optimize your queries (and indexes resp.) according to some predictions about your data. So if you have `2 users and 300 pages` then you should try to put `page_id` constraint first in your queries(and indexes) because that column has better statistics, better distribution (its commonly called 'selectivity'). The same goes for ` 300 users and 2 pages` (user_id is more 'selective') – lx00st Aug 14 '15 at 10:38
  • I'm honestly not sure that it makes that much difference to the efficiency of the query, especially if the values you are querying for are guaranteed to exist in the table. If there was a chance of finding out that there were no values at all for page_id = 345 even if there are values for user_id = 123, then I'd say that an index on (page_id,user_id) makes more sense than (user_id,page_id). That aside, I would urge testing on real-world data to verify whether one method is significantly faster than the other. – David Aldridge Aug 14 '15 at 10:51
  • Efficiency is not drastic but it exists. Index Is a tree with linked leafs. Scanning a tree is very fast, scanning leafs - not so much fast. That's efficiency) – lx00st Aug 14 '15 at 11:15
  • Actually have just tested it on postgres with 1000 rows and there is no considerable difference) Which means that if you have query planner that can choose a column with better statistics than it is really no matter) – lx00st Aug 14 '15 at 12:03