How does one implement hash tables in a functional language?

Question

Is there any way to implement hash tables efficiently in a purely functional language? It seems like any change to the hash table would require creating a copy of the original hash table. I must be missing something. Hash tables are pretty darn important data structures, and a programming language would be limited without them.

Hash tables are just one way to implement associative arrays. The latter do exist in purely functional languages. — , Jul 22 '11 at 16:45
What you're missing is that you severely overestimate the importance of hash tables. Specific data structures don't matter, their performance characteristics do. — C. A. McCann, Jul 22 '11 at 17:42
@camccann: Name another data structure that has O(1) random insertion and lookup. — Matt Fichman, Jul 24 '11 at 04:07
You'll need to specify what you mean by O(1) insertion/lookup in this case. In the strictest sense, hash tables aren't O(1) either. — C. A. McCann, Jul 24 '11 at 04:16
Hash tables have O(1) (i.e., constant) amortized lookup, if properly sized; that is, the lookup time is independent of the size of the table. — Matt Fichman, Jul 28 '11 at 21:51
Actually, the worst case is not O(1), since you can get very unlucky with your hash function. It's only O(1) guaranteed if you have a perfect hash function and a table size which is proportional to N. — Phob, Aug 09 '11 at 18:26
@C.A.McCann "What you're missing is that you severely overestimate the importance of hash tables". No, he doesn't. Hash tables are by far the fastest dictionary in most practical cases. The only notable exception is arrays when the keys are integers within a small enough range. — J D, Jun 05 '12 at 10:21
Hash tables are good for lots of things, but don't really fit the functional paradigm so well. Try finger trees, or see Okasaki 96. The key is that it's far from the case that any change to a data structure means the whole thing has to be copied: instead, data structures providing persistence (the functional property that old states have to be available) are carefully ordered so that modifications can be made with very little copying. — Nicholas Wilson, Jun 05 '12 at 15:11
@Jon Harrop: You also severely overestimate the importance of hash tables, and have completely failed to justify the importance you assign them. Also, is it *really* necessary to post uninformative comments on a question that's over a year old? — C. A. McCann, Jun 16 '12 at 23:56
@C.A.McCann "You also severely overestimate the importance of hash tables, and have completely failed to justify the importance you assign them". Hash tables are often over 10x faster than any purely functional sets or dictionaries. Consequently, the performance of most purely functional graph algorithms is often abysmal in practice and that is often unacceptable. — J D, Jun 17 '12 at 11:35
@C.A.McCann "Your logic is flawed and your assumptions about what's useful in practice are misguided. Please educate yourself further instead of wasting my time". Well, I've benchmarked red-black trees, AVL trees, weight-balanced trees, splay trees, hash tries and open and closed hash tables with numerous different key and value types in several different languages (including my own language on my own VM with my own GC) and I've influenced dictionaries and GCs in OCaml, F#, .NET, Haskell and Mathematica and consulted for many companies on the subject (graphics, technical computing, financial). — J D, Jun 17 '12 at 19:01
@JonHarrop: Those are narrow, largely academic niches with very minor impact on most real world software development; the performance issues you raise are, 99% of the time, unnecessary premature optimization. I can't help but note that your puffed-up list of accomplishments doesn't mention *actually writing software that people use*... — C. A. McCann, Jun 17 '12 at 20:57
Unless @Jon asks to have the rude comment deleted, I'm of the opinion to let it stand. Its kind of funny, imho. — , Jun 18 '12 at 03:10
@C.A.McCann "largely academic niches". These are all commercial code bases in industry doing completely different things that I have worked on. The graphics apps use dictionaries to represent subdivision curves and surfaces (e.g. PS, PDF, WPF) where they are perf critical. Mathematica's core is a global rewrite table represented as a hash table and it is performance critical (they paid me to do a comparison with purely functional dictionaries). In finance, dictionaries are used to represent an in-memory database associating desks, traders, bids/offers, trades and so on, both client and server. — J D, Jun 18 '12 at 05:48
@delnan Yes, but my question was asking specifically about hash tables. I was curious how hash tables in particular are implemented in functional languages, since they are (usually) the fastest associative array implementation in the general case. — Matt Fichman, Dec 06 '13 at 18:06

J D · Accepted Answer · 2019-01-18T20:49:49.430

23

Is there any way to implement hash tables efficiently in a purely functional language?

Hash tables are a concrete implementation of the abstract "dictionary" or "associative array" data structure. So I think you really want to ask about the efficiency of purely functional dictionaries compared to imperative hash tables.

It seems like any change to the hash table would require creating a copy of the original hash table.

Yes, hash tables are inherently imperative and there is no direct purely functional equivalent. Perhaps the most similar purely functional dictionary type is the hash trie but they are significantly slower than hash tables due to allocations and indirections.

I must be missing something. Hash tables are pretty darn important data structures, and a programming language would be limited without them.

Dictionaries are a very important data structure (although its worth noting that they were rare in the mainstream until Perl made them popular in the 1990s, so people coded stuff for decades without benefit of dictionaries). I agree that hash tables are also important because they are often by far the most efficient dictionaries.

There are many purely functional dictionaries:

Balanced trees (red-black, AVL, weight-balanced, finger trees etc.), e.g. Map in OCaml and F# and Data.Map in Haskell.
Hash tries, e.g. PersistentHashMap in Clojure.

But these purely functional dictionaries are all much slower than a decent hash table (e.g. the .NET Dictionary).

Beware Haskell benchmarks comparing hash tables to purely functional dictionaries claiming that purely functional dictionaries are competitively performant. The correct conclusion is that Haskell's hash tables are so inefficient that they are almost as slow as purely functional dictionaries. If you compare with .NET, for example, you find that a .NET Dictionary can be 26× faster than Haskell's hash table!

I think to really conclude what you're trying to conclude about Haskell's performance you would need to test more operations, use a non-ridiculous key-type (doubles as keys, what?), not use -N8 for no reason, and compare to a 3rd language that also boxes its parametric types, like Java (as Java has acceptable performance in most cases), to see if its a common problem of boxing or some more serious fault of the GHC runtime. These benchmarks are along these lines (and ~2x as fast as the current hashtable implementation).

This is exactly the kind of misinformation I was referring to. Pay no attention to Haskell's hash tables in this context, just look at the performance of the fastest hash tables (i.e. not Haskell) and the fastest purely functional dictionaries.

edited Jan 18 '19 at 20:49

answered Jun 05 '12 at 12:52

J D

48,105
13
171
274

2

I think to really conclude what you're trying to conclude about Haskell's performance you would need to test more operations, use a non-ridiculous key-type (doubles as keys, what?), not use `-N8` for no reason, and compare to a 3rd language that also boxes its parametric types, like Java (as Java has acceptable performance in most cases), to see if its a common problem of boxing or some more serious fault of the GHC runtime. These [benchmarks](http://gregorycollins.net/posts/2011/06/11/announcing-hashtables) are along these lines (and ~2x as fast as the current hashtable implementation). – ScottWest Jun 05 '12 at 14:21
@ScottWest My last paragraph already explained why such slightly-less-slow-hash-table-in-Haskell benchmarks are irrelevant. This is about fast hash tables vs fast purely functional dictionaries. – J D Jun 05 '12 at 15:03
2

I agree completely with @JonHarrop, purely functional data structures just can't copy hash tables, and you can't fake them. I think this really deserves to be the accepted answer. – Kristopher Micinski Jun 05 '12 at 15:22
4

Misinformation? Your (now) second to last paragraph is basically a nonsequitur that only aims to make an off-topic snipe at Haskell performance. Trim that and I would gladly upvote the answer. – ScottWest Jun 05 '12 at 19:19
"burns all eight cores" -- that sounds a little suspicious, GHC won't even use more cores unless you compile with `-threaded` and won't use them unless you tell it to at runtime. There may be a bad case there somewhere, but it seems you have to go looking for it, ie running with `+RTS -N8`. – ScottWest Jun 07 '12 at 07:55
@JonHarrop Well, I'm just saying that to get all those cores burnt you have to strive for it, you did use `-N8`, correct? – ScottWest Jun 08 '12 at 03:16
@JonHarrop Is there a better way to mark an part of an array during GC other than scanning it (linear time)? Put a tree on top to narrow down which elements were changed (it seems this is what card marking is, dividing the array up into pieces) ? – ScottWest Jun 08 '12 at 03:18
@JonHarrop I agree with using the multithreaded runtime (`-threaded`) as it's enabling the comparison of like with like, but if `-N8` gives worse performance, why use it? You could probably offset its effect by using `-qg1` to put the parallel collection only in generations >= 1. Btw, thanks for the pointers on array GC. – ScottWest Jun 08 '12 at 10:24
It's true that GHC has this dimension that other runtimes generally do not: "how many cores would you like to use". You have to resolve the difference somehow, but I think you resolved it by telling GHC to use 8 cores for parallel collection, while the CLR only had one thread of execution so it didn't have to worry about it (I assume the mutators collect themselves but I'm not sure). As I said you can keep `-N8` then disable parallel GC for Gen-0 with `-qg1`, the GHC documentation suggests that this may help. – ScottWest Jun 10 '12 at 16:16
yeah, that's been known to happen[.](https://meta.stackexchange.com/questions/164299/only-the-qualified-should-vote) – Will Ness Jan 18 '19 at 21:13

Phob · Answer 2 · 2017-05-14T02:17:28.987

10

Hash tables can be implemented with something like the ST monad in Haskell, which basically wraps IO actions in a purely functional interface. It does so by forcing the IO actions to be performed sequentially, so it maintains referential transparency: you can't access the old "version" of the hash-table.

See: hackage.haskell.org/package/hashtables

edited May 14 '17 at 02:17

answered Jul 22 '11 at 16:48

Phob

891
8
13

score 9 · Answer 3 · answered Jun 06 '12 at 15:28

The existing answers all have good points to share, and I thought I would just add one more piece of data to the equation: comparing performance of a few different associative data structures.

The test consists of sequentially inserting then looking up and adding the elements of the array. This test isn't incredibly rigorous, and it shouldn't be taken as such, it just an indication of what to expect.

First in Java using HashMap the unsynchronized Map implementation:

import java.util.Map;
import java.util.HashMap;

class HashTest {
    public static void main (String[] args)
    {
        Map <Integer, Integer> map = new HashMap<Integer, Integer> ();
        int n = Integer.parseInt (args [0]);
        for (int i = 0; i < n; i++)
            {
                map.put (i, i);
            }

        int sum = 0;
        for (int i = 0; i < n; i++)
            {
                sum += map.get (i);
            }


        System.out.println ("" + sum);
    }
}

Then a Haskell implementation using the recent hashtable work done by Gregory Collins (its in the hashtables package). This can be both pure (through the ST monad) or impure through IO, I'm using the IO version here:

{-# LANGUAGE ScopedTypeVariables, BangPatterns #-}
module Main where

import Control.Monad
import qualified Data.HashTable.IO as HashTable
import System.Environment

main :: IO ()
main = do
  n <- read `fmap` head `fmap` getArgs
  ht :: HashTable.BasicHashTable Int Int <- HashTable.new
  mapM_ (\v -> HashTable.insert ht v v) [0 .. n - 1]
  x <- foldM (\ !s i -> HashTable.lookup ht i >>=
               maybe undefined (return . (s +)))
       (0 :: Int) [0 .. n - 1]
  print x

Lastly, one using the immutable HashMap implementation from hackage (from the hashmap package):

module Main where

import Data.List (foldl')
import qualified Data.HashMap as HashMap
import System.Environment

main :: IO ()
main = do
  n <- read `fmap` head `fmap` getArgs
  let
    hashmap = 
        foldl' (\ht v -> HashMap.insert v v ht) 
           HashMap.empty [0 :: Int .. n - 1]
  let x = foldl' (\ s i -> hashmap HashMap.! i + s) 0 [0 .. n - 1]
  print x

Examining the performance for n=10,000,000 , I find the total running time is the following:

Java HashMap -- 24.387s
Haskell HashTable -- 7.705s, 41% time in GC (
Haskell HashMap -- 9.368s, 62% time in GC

Knocking it down to n=1,000,000, we get:

Java HashMap -- 0.700s
Haskell HashTable -- 0.723s
Haskell HashMap -- 0.789s

This is interesting for two reasons:

The performance is generally pretty close (except where Java diverges above 1M entries)
A huge amount of time is spent in collection! (killing Java in the case of n=10,0000,000).

This would seem to indicate that in languages like Haskell and Java which have boxed the map's keys see a big hit from this boxing. Languages that either do not need, or can unbox the keys and values would likely see couple times more performance.

Clearly these implementations are not the fastest, but I would say that using Java as a baseline, they are at least acceptable/usable for many purposes (though perhaps someone more familiar with Java wisdom could say whether HashMap is considered reasonable).

I would note that the Haskell HashMap takes up a lot of space compared to the HashTable.

The Haskell programs were compiled with GHC 7.0.3 and -O2 -threaded, and run with only the +RTS -s flag for runtime GC statistics. Java was compiled with OpenJDK 1.7.

FYI, on my MacBook, a C# equivalent finishes in ~0.4s for n=10,000,000 — Yadli, Mar 27 '16 at 07:35

How does one implement hash tables in a functional language?

3 Answers3

Linked