2

In postgres, there is a partitioning based on hash. But postgres does not clearly explains how hashing of the given column's value is calculated.

I have searched through Postgres doc and have found nothing. Except in some mailbox posts, some people mentioned hashtext() internal function. Does anyone have any info about the actual function used for hashing of value (and further using modulus operator)? I mean how postgres hashes a value, converts it to uint64 to the final modulus operation.

Update: Reading through postgres source code, I have found partitioning functions use a method like this when they try to find an uint64 value of a given hash:

/*
 * DatumGetUInt64
 *      Returns 64-bit unsigned integer value of a datum.
 *
 * Note: this macro hides whether int64 is pass by value or by reference.
 */

#ifdef USE_FLOAT8_BYVAL
#define DatumGetUInt64(X) ((uint64) (X))
#else
#define DatumGetUInt64(X) (* ((uint64 *) DatumGetPointer(X)))
#endif
Mostafa Talebi
  • 8,825
  • 16
  • 61
  • 105

1 Answers1

3

The hash function used is the support function for the hash index operator family. You can find them in the pg_amproc system catalog; join with pg_opfamily and restrict the query to operator families for the hash access method

This query lists the standard hash support functions for each type:

SELECT DISTINCT
       ap.amproclefttype::regtype AS data_type,
       ap.amproc::regproc AS hash_function
FROM pg_amproc AS ap
   JOIN pg_opfamily AS of ON ap.amprocfamily = of.oid
   JOIN pg_am ON of.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
  AND ap.amprocnum = 1
ORDER BY amproclefttype::regtype::text;

The function should be irrelevant, but I understand your curiosity.

Note that hash partitioning is pretty useless unless you have the partitions on different storage to spread I/O load.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
  • Is useless? Because I need rows with the same key to be always in the same partition, something like Kafka's topic partitioning. And this does not include anything date-based or id-based logic that I can use list or range partitioning type – Mostafa Talebi Nov 02 '20 at 16:17
  • 2
    Why? if you do range or list partitioning by the key, the rows for the same key will also be in the same partition. And partitioning is not required just because a table is big. Hash partitioning will actually slow down all queries. – Laurenz Albe Nov 02 '20 at 16:21
  • So what happen if I have millions of keys? They are not predefined, they are created from various variables which result in millions of keys. My data is quite big and I would like postgres query planner to scan only one (or at most a few) partitions. Is this possible (millions of hashed keys) to be achieved through range partitioning? – Mostafa Talebi Nov 02 '20 at 16:50
  • I see, your use case is searching for the partitioning key with an "equals" condition, right? But do you really want a sequential scan for such a search, even if it is only on a single partition? As soon as you have an index on the column, the query will actually become slower with partitioning (more work for the query planner). – Laurenz Albe Nov 03 '20 at 06:17
  • So you mean I avoid partitioning for such a big big table? – Mostafa Talebi Nov 04 '20 at 11:20
  • I'm not saying that you should not partition, but that you should think how and why to partition. Typically, you want to partition so that you can get rid of old data easily. – Laurenz Albe Nov 04 '20 at 11:23
  • 4
    No, it is very useful if you find the right column to partition. In our system, we cut the update time from 40s to less than a second with the hash partition. – Tom Yeh May 10 '21 at 10:29
  • @TomYeh True, if all you need is equality search, and you don't have an index on the column. – Laurenz Albe May 10 '21 at 11:43