17

I've come across some code in the application I'm working on that makes a database call merely to call the ORA_HASH function (documentation) on a UUID string. The reason it's doing this is that it needs the value to make a service call to another system that appears to use ORA_HASH for partitioning.

I would like to know the algorithm ORA_HASH uses so that I can re-implement it to make a similar service call for an application that won't have access to a real database, let alone Oracle. I've only been able to find what amounts to Oracle API documentation so far.

Just to be super clear: I need to clone ORA_HASH because that's what another system that's outside of my control uses, and I need to integrate with that system. Yes, it would nice if could use a really standard algorithm, like MD5, but I can't, unless that's what ORA_HASH is under the covers.

Answers or comments that propose the use of a hash algorithm besides ORA_HASH are not helpful. This question is specifically about ORA_HASH, not hashing or partitioning in general.

Kaypro II
  • 3,210
  • 8
  • 30
  • 41
  • If you want to re-implement to make a call for an application that doesn't have access to a real database, why do you need to re-implement `ORA_HASH`? What's so magic about it? Write your own simple hashing function with no reference to `ORA_HASH`. –  Aug 29 '17 at 21:05
  • The key words here are "making a service call to another system." If I use a different algorithm, I'll get a different hash value and the call won't work. – Kaypro II Aug 29 '17 at 21:06
  • Oh... I see; the "other system" that "appears to use `ORA_HASH`" is not under your control. My point was, use the same function you create in both; but if you only control one side, I see why you need this. –  Aug 29 '17 at 21:09
  • This has been asked before. See https://stackoverflow.com/questions/24375747/how-to-implement-ora-hash-seedable-hash-that-divides-any-sql-datatype-into-n-bu –  Aug 29 '17 at 21:17
  • 1
    @mathguy I don't think this is an exact duplicate, although it's debatable. To me, that other question is more about how to create something similar in concept to ORA_HASH. The exact algorithm used by ORA_HASH is not necessarily needed to answer that other question. – Jon Heller Aug 29 '17 at 22:25
  • @JonHeller - I don't mean it's an exact duplicate. But, as discussed in the other thread, Oracle never disclosed how ORA_HASH works - what algorithm it implements, how it implements it, etc. I think that answers the question in this thread, too. –  Aug 29 '17 at 23:09
  • 1
    just implement a standard hashing algo like MD5, most languages support that out of the box. So all of your "systems" can be consistent. Don't roll your own algo – tbone Aug 29 '17 at 23:40
  • So which part of the `ora_hash()` function is the issue here? Java supports MD5 which you can use for the hashing algorithm [Find out more](https://stackoverflow.com/q/415953/146325). Perhaps the trickier part will be assigning the hashed values across [an arbitrary number of buckets in a performative fashion](http://www.growingwiththeweb.com/2015/06/bucket-sort.html). – APC Aug 30 '17 at 06:17
  • Better use the "public" hash from [DBMS_CRYPTO.Hash](http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_crypto.htm#i1002022) I think by using virtual columns you can also do partitioning on such values. – Wernfried Domscheit Aug 30 '17 at 06:48
  • @mathguy This is not a duplicate. That question seems to use ORA_HASH as an example of the kind of hash algorithm it wants. This question is specifically about the algorithm ORA_HASH uses. – Kaypro II Aug 30 '17 at 18:00
  • Yes, see my reply to Jon Heller's comment. Same answer to you. –  Aug 30 '17 at 18:23
  • @tbone (and others) Suggestions to use alternative hash algorithms are not helpful. I cannot change the other system that uses ORA_HASH, and it would be totally pointless to use an incompatible algorithm. That's why I'm specifically asking about the implementation of the ORA_HASH function. – Kaypro II Aug 30 '17 at 18:26
  • 2
    https://chessprogramming.wikispaces.com/Bob+Jenkins seems like a candidate. (but burble can be modified and seeded) – wildplasser Aug 30 '17 at 18:44
  • 3
    The exact ora_hash implementation is not known as far as I'm aware. ora_hash is meant as a bucketing (bining) function, so partitioning data into x parts would make sense. However, why on earth would the value of ora_hash be required by this service to make a call??? To me, the service is broken if ora_hash is required. Fix the service. (I know, you're gonna tell me you can't touch the service ;-) – tbone Aug 30 '17 at 19:37
  • If you cannot compute it, you can always store it, just like you store SSNs, licence plate numbers, etc. – wildplasser Aug 30 '17 at 20:25
  • 2
    Not an answer but...Could you connect to an Oracle and make db execute it for you by procedure, dont you? Not the most elegant way, true, but 100% implemented. – DvTr Dec 13 '17 at 02:00
  • 1
    @DvTr I'm not sure about the OP, but for my problem I want to know the algorithm so I can recreate it in PL/SQL. The ORA_HASH function works in SQL but not in PL/SQL, and a data-masking product we use runs `select ora_hash(...) from dual` millions of times. I'd like to optimize it without changing the way it works so I want to precisely recreate ORA_HASH. – Jon Heller Dec 13 '17 at 15:42
  • if this question had a open bitcoin bounty, you might get an answer ;-) – tbone Dec 13 '17 at 17:02
  • @JonHeller not sure what you mean by "The ORA_HASH function works in SQL but not in PL/SQL". It works just fine in pl/sql. Maybe I misunderstand what you want though – tbone Dec 14 '17 at 18:35
  • @tbone `begin dbms_output.put_line(ora_hash(1,1)); end; /` raises "PLS-00201: identifier 'ORA_HASH' must be declared". A few of the SQL functions don't work natively in PL/SQL. There's `dbms_utility.get_hash_value` but it doesn't allow for a seed value. – Jon Heller Dec 14 '17 at 18:44
  • @JonHeller interesting, it works for me on 11.2 I'll post a simple function and example if that helps. – tbone Dec 14 '17 at 18:51
  • @DvTr Your suggestion is exactly what I described in the first sentence of my question, and it's also exactly what I want to **avoid** doing. I do not want to connect to an Oracle instance in any way, shape, or form to execute this function. – Kaypro II Dec 29 '17 at 14:44

2 Answers2

24

another system that appears to use ORA_HASH

Well, if it "appears to use" then it makes sense to do a bit of reverse engineering and check what exactly is called and disassemble code of the function.

If you, however, want to dive into Oracle internals then following may help.

First of all, you have to figure out what internal C function is called. To do that you can execute some long running code in one session. I did run this

select avg(ora_hash(rownum)) id from
(select rownum from dual connect by rownum <= 1e4),
(select rownum from dual connect by rownum <= 1e4);

It can be PL/SQL code as well, you just need to make sure that you constantly call ora_hash.

While it's running

I tested on Windows and looks like that ora_hash is ...->evaopn2()->evahash()->...

Now let's google for evahash. We got extremely lucky because there is a header file on official site https://oss.oracle.com/projects/ocfs-tools/src/branches/new-dir-format/libocfs/Linux/inc/ocfshash.h with link to evahash.

And finally there is page with actual C code http://burtleburtle.net/bob/hash/evahash.html

So far so good, we remember that we can use external C function in Oracle if we build it into library (DLL on Windows).

For example on my Win x64 if I change function signature to

extern "C" ub4 hash( ub1 *k, ub4 length, ub4 initval)

it can be successfully executed from Oracle. But, as you see, signature a bit differs from ora_hash in Oracle. This function accepts value, its length and initval (may be seed) while signature in Oracle is ora_hash(expr, max_bucket, seed_value).

Let's try to test Oracle

SQL> select ora_hash(utl_raw.cast_to_raw('0'), power(2, 32) - 1, 0) oh1,
  2         ora_hash('0', power(2, 32) - 1, 0) oh2,
  3         ora_hash(0, power(2, 32) - 1, 0) oh3,
  4         ora_hash(chr(0), power(2, 32) - 1, 0) oh4
  5    from dual;

       OH1        OH2        OH3        OH4
---------- ---------- ---------- ----------
3517341953 3517341953 1475158189 4056412421

C

int main()
{
    ub1 ta[] = {0};
    ub1* t = ta;
    cout << hash(t, 1, 0) << endl;
    ub1 ta0[] = {'0'};
    ub1* t0 = ta0;
    cout << hash(t0, 1, 0) << endl;
    return 0;
}

1843378377
4052366646

None of the numbers matches. So what is the problem? ora_hash accepts parameters of almost any type (for example select ora_hash(sys.odcinumberlist(1,2,3)) from dual) while C function accepts value as array of bytes. This means that some conversion happens before function call. Thus before using mentioned C hash function you have to figure out how actual value is transformed before passing to it.

You can proceed with reverse engineering of Oracle binaries using IDA PRO + hex rays but that may take days. Not to mention platform specific details.

So if you want to imitate ora_hash, the easiest option would be to install Oracle express edition and use it to call ora_hash.

I hope that was interesting. Good luck.

Update

ora_hash and dbms_utility.get_hash_value can be mapped to each other (see https://jonathanlewis.wordpress.com/2009/11/21/ora_hash-function/)

SQL> select dbms_utility.get_hash_value('0', 0 + 1, 1e6 + 1) ha1,
  2         ora_hash('0', 1e6, 0) + 1 ha2
  3    from dual;

       HA1        HA2
---------- ----------
    338437     338437

If we unwrap package body of dbms_utility we will see following declaration

  function get_hash_value(name varchar2, base number, hash_size number)
    return number is
  begin
    return(icd_hash(name, base, hash_size));
  end;

and

  function icd_hash(name      varchar2,
                    base      binary_integer,
                    hash_size binary_integer) return binary_integer;
  pragma interface(c, icd_hash);

Let's google for icd_hash and we can find that it's mapped to _psdhsh (https://yurichev.com/blog/50/). Now it's time to disassemble oracle.exe and extract code for _psdhsh from it. Maybe I'll spend some time on this next year.

Dr Y Wit
  • 2,000
  • 9
  • 16
  • I believe I could use this to build an answer slightly more quickly. There has to be a format you could pass in that doesn't get transformed much. – Dylan Brams Dec 19 '17 at 19:42
  • @DylanB Go for it. If it takes longer than the bounty period, don't worry, I'll create another 500 point bounty and award it to you or anyone else who builds it first. (Assuming the code is either in PL/SQL or callable from PL/SQL and matches ORA_HASH.) – Jon Heller Dec 20 '17 at 01:10
  • @Dylan B, I believe it would be enough if you figure out at least how it works for varchar2 parameters. I played with one-byte values to make it as simple as possible and to avoid issues with big-endian/little-endian and other stuff. Oracle may add one "service" byte to the value or reverse bits for all bytes of whatever. It may be something more complex and it's extremely hard to figure it out without reverse engineering of Oracle binaries. – Dr Y Wit Dec 21 '17 at 11:39
  • @Jon Heller, well, reverse engineering is more for C/assembler developers. So you may want to add those tags to attract audience. Although I think that it'd easier to contact this person http://burtleburtle.net/bob/index.html. Please let me know if he provides an answer regarding transformation into array of "unsigned char". – Dr Y Wit Dec 21 '17 at 11:47
  • While you didn't fully answer my question, your answer is nonetheless extremely impressive to this lowly enterprise developer and also a large step in the right direction. Thank you. – Kaypro II Dec 29 '17 at 05:58
1

This doesn't answer the OP question of the actual algo behind ora_hash. This is just an example of using ora_hash in pl/sql (answering @JonHeller comment):

The function:

SQL> create or replace function get_ora_hash(i_str in varchar2, i_max_bucket in number default 4294967295, i_seed number default 0)
return number deterministic
parallel_enable
as
  rv number:= 0;
begin

select ORA_HASH(i_str, i_max_bucket, i_seed) 
into rv 
from dual;

return rv;

end;
Function created.

And using it:

SQL> declare
  l_val number;
begin
  l_val := get_ora_hash('test');
  dbms_output.put_line(l_val);
end;
 PL/SQL procedure successfully completed.

Dbms Output:

2662839991

You can also mess around with RESULT_CACHE or other techniques to try to speed things up even more.

Its very fast already. For example, calling the function 1 million times on a large table:

SQL> set serveroutput on
SQL> declare
  l_val number;
  l_start_dte timestamp;
  l_end_dte timestamp;
  l_interval INTERVAL DAY(9) TO SECOND(9);
  l_cnt number := 0;
begin
  l_start_dte:= systimestamp;
  --for rec in (select object_name from dba_objects)
  for rec in (select name from my_big_table where rownum <= 1000000)
  loop
    l_cnt := l_cnt + 1;
    l_val := get_ora_hash(rec.name);
  end loop;
  l_end_dte:= systimestamp;
  l_interval := l_end_dte - l_start_dte;
  dbms_output.put_line('Rows processed: ' || l_cnt 
    || ', Start: ' || l_start_dte  
    || ', End: ' || l_end_dte 
    || ', Interval: ' || l_interval);
end;
Rows processed: 1000000, Start: 14-DEC-17 02.48.31.138212 PM, End: 14-DEC-17 02.48.41.148884 PM, Interval: +000000000 00:00:10.010672000
 PL/SQL procedure successfully completed.

So basically 100k rows per second, that includes any context switches you may be worried about.

If you need to reproduce ORA_HASH because of performance, I would suggest that your performance bottleneck may be elsewhere.

tbone
  • 15,107
  • 3
  • 33
  • 40
  • The masking product I'm using already does this. The context switches from the "select ora_hash ... from dual" are horribly slow. It would be much faster if the code could execute `return ora_hash...` but that is not allowed. Which is why knowing the ORA_HASH algorithm could be useful as we could potentially recreate it in pure PL/SQL. – Jon Heller Dec 14 '17 at 19:33
  • @JonHeller I updated my answer. Its very fast actually. – tbone Dec 14 '17 at 19:53
  • Unfortunately our data masking product calls ORA_HASH multiple times for every column in every row, and we have billions of rows to process. So an alternative implementation would still be helpful. – Jon Heller Dec 14 '17 at 20:47
  • @JonHeller I would argue that even if you found an alternative implementation that yields the same results, it wouldn't be any faster. Post your data masking implementation if you can, my bet is that your slowness is due to something else in this process. – tbone Dec 14 '17 at 21:07
  • 2
    I feel this should be broken off into another question (i.e. "How can I call ora_hash from PL/SQL?"). – Kaypro II Dec 29 '17 at 05:59
  • @KayproII maybe so, but JonHellers question was really about concerns over performance of ora_hash, which may be why some would want to know the exact algo so they could implement another way. I realize others just want it so they can avoid Oracle (and license issues) entirely. – tbone Dec 29 '17 at 13:00