Why Postgresql searches Text index faster than Int index?

Question

CREATE TABLE index_test
(
  id int PRIMARY KEY NOT NULL,
  text varchar(2048) NOT NULL,
  value int NOT NULL
);
CREATE INDEX idx_index_value ON index_test ( value );
CREATE INDEX idx_index_value_and_text ON index_test ( value, text );
CREATE INDEX idx_index_text_and_value ON index_test ( text, value );
CREATE INDEX idx_index_text ON index_test ( text );

The table is populated with 10000 random rows, 'value' column has integers from 0 to 100, 'text' column has random 128 bit md5 hash. Sorry for using bad column names.

My searches are:

select * from index_test r where r.value=56;
select * from index_test r where r.value=56 and r.text='dfs';
select * from index_test r where r.text='sdf';

Anytime I make some search...

if only indexes on 'text' and/or 'value' columns are presented
if combined ('text' and 'value' together) indexes are presented

... so, anytime I see the following picture:

The search for integer column 'value' is

slower
is combined from 2 searches: *Bitmap Heap Scan on index_test* and *Bitmap Index Scan on idx_index_value*

The search for varchar column 'text' is

faster
always using an index scan

Why searching for String is easier than searching for Integer? Why the the search plans differ in that way? Is there any similar situations when this effect can be reproduced and can be helpful for developers?

score 5 · Accepted Answer · answered Mar 18 '13 at 16:37

5

As the text is a hash, unique by definition, there will be one only row in the 10k rows of the table matching that text.

The 56 value will exist about 100 times inside the 10k rows and it will be scattered all over the table. So the planner goes first to the index and find the pages where those rows are. Then it visits each of those scattered pages to retrieve the rows.

answered Mar 18 '13 at 16:37

Clodoaldo Neto

118,695
26
233
260

1

Just because a hash is unique doesn't mean that hash will be the only one in the database. For instance one use for a hash might be to check for duplicate, complex things that may be represented in your table. This negates your argument (which wasn't even a reason for a text index being faster than a numerical index). – Feb 09 '17 at 22:47
1

@MikeBethany You did not read the question or did not understand it. Try harder and then search for _cardinality_ – Clodoaldo Neto Feb 09 '17 at 23:42
I did read the question and I understand it. Your answer is factually wrong. If you disagree please point out how as I pointed out how your answer is wrong. You'll be better served using facts instead of emotion. – Feb 10 '17 at 13:38
@MikeBethany So why don't you post your own answer? The correct one? – Clodoaldo Neto Feb 10 '17 at 13:54
1

I'm much more interested in learning how I was wrong so I can improve myself. Let's work together to come up with the right answer. – Feb 10 '17 at 14:06
@MikeBethany My work is above in this answer. Now do yours. You will test your comments by building an answer. One that makes contact with the question. – Clodoaldo Neto Feb 10 '17 at 14:15
@ClodoaldoNeto - calling random 128 bit hashes doesn't guarantee their uniqueness. In fact, hashes are NOT unique, *by definition*, because there are exactly `2**128` of them. The author didn't say they went through and weeded out duplicates to ensure all 10,000 were unique. While a collision in 10,000 entries out of 128 bits is unlikely, it's by no means impossible, so why do you think they are definitively and definitionally unique? – dwanderson Jan 18 '19 at 23:01
@dwanderson, MikeBethany, can somebody :), calculate (** ) the probability, then update the answer as follows: Original: "unique by definition", Suggested: "effectively(*) unique" ..."-- * we can say this text is effectively unique, because uniqueness of random 128 bit md5 hash is exactly the same as UUID's uniqueness. The precise probability (p) of the collision is 1 - M! / M^n / (M - n)!, with n = 10,000 and M = 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 it gives p = **"? – epox Jul 21 '19 at 04:31
https://stackoverflow.com/a/1155047/2272638 - not sure if they truly did the math or not but it suggests that generating 1billion hashes (UUIDs, but whatever, 128 bits) per second for 100 years yields ~50% chance of collision. I think the math works out to 60 bits though, and my computer can't handle the math for that factorial :). To be clear, I don't think a collision is _likely_, obviously, but it is not _definitionally impossible_. – dwanderson Jul 22 '19 at 21:29

Why Postgresql searches Text index faster than Int index?

1 Answers1