2

I understand one aspect of why Symbols should be used as opposed to Strings in Hashes. Namely that there is only one instance of a given Symbol in memory whereas there could be multiple instances of a given String with the same value.

What I don't understand is how Symbols are faster than Strings in a Hash lookup. I've looked at the answers here, but I still don't quite get it.

If :foo.hash == :foo.object_id returned true, then it would've made some sense because then it'd be able to use the object id as the value for the hash and wouldn't have to compute it every time. However this isn't the case and :foo.object_id is not equal to :foo.hash. Hence my confusion.

Asad Moosvi
  • 487
  • 5
  • 18

3 Answers3

7

There's no obligation for hash to be equivalent to object_id. Those two things serve entirely different purposes. The point of hash is to be as deterministic and yet random as possible so that the values you're inserting into your hash are evenly distributed. The point of object_id is to define a unique object identifier, though there's no requirement that these be random or evenly distributed. In fact, randomizing them is counter-productive, that'd just make things slower for no reason.

The reason symbols tend to be faster is because the memory for them is allocated once (garbage collection issues aside) and recycled for all instances of the same symbol. Strings are not like that. They can be constructed in a multitude of ways, and even two strings that are byte-for-byte identical are likely to be different objects. In fact, it's safer to presume they are than otherwise unless you know for certain they're the same object.

Now when it comes to computing hash, the value must be randomly different even if the string changes very little. Since the symbol can't change computing it can be optimized more. You could just compute a hash of the object_id since that won't change, for example, while the string needs to take into account the content of itself, which is presumably dynamic.

Try benchmarking things:

require 'benchmark'

count = 100000000

Benchmark.bm do |bm|
  bm.report('Symbol:') do
    count.times { :symbol.hash }
  end
  bm.report('String:') do
    count.times { "string".hash }
  end
end

This gives me results like this:

       user     system      total        real
Symbol:  6.340000   0.020000   6.360000 (  6.420563)
String: 11.380000   0.040000  11.420000 ( 11.454172)

Which in this most trivial case is easily 2x faster. Based on some basic testing the performance of the string code degrades O(N) as the strings get longer but the symbol times remain constant.

tadman
  • 208,517
  • 23
  • 234
  • 262
  • So `String#hash` takes longer to run on average than `Symbol#hash` or `Fixnum#hash` because it's performing more computation to calculate the hash due to its dynamic nature. Right? – Asad Moosvi Jul 16 '17 at 20:19
  • The hash of the string is computed based on the entire contents of the string, and as the contents can change the hash can change as well. The hash of a "singleton" like Symbol should be trivial by comparison. I'm not sure about Fixnum/Integer, but you could try benchmarking that as well if you're curious. Should be similar. – tadman Jul 16 '17 at 20:23
  • Yup, `Fixnum#hash` is fairly similar to `Symbol#hash`. – Asad Moosvi Jul 16 '17 at 20:24
  • 1
    Using symbol-based keys is something you should aspire to do *unless* you're dealing with completely arbitrary data. Symbols are a one-time cost in terms of memory, but that cost can accumulate if you use them for everything. Imagine "Symbol" as being like "Cached (internalized) string". You'd only want to do that with things that are used fairly frequently, like `{ name: "Bob", age: 21 }`, method arguments, and so on. – tadman Jul 16 '17 at 20:26
  • 1
    Very interesting and informative (and well-presented). A great answer! – Cary Swoveland Jul 17 '17 at 05:06
1

Just want to add that I do not entirely agree with the numbers that @tadman came up with. On my testing it is at most 1.5 faster to use calcualte '#hash'. I used benchmark/ips to test performance.

require 'benchmark/ips'

Benchmark.ips do |bm|
  bm.compare!
  bm.report('Symbol:') do
    :symbol.hash
  end
  bm.report('String:') do
    'string'.hash
  end
end

And this results in

Comparison:
             Symbol:: 10741305.8 i/s
             String::  7051559.3 i/s - 1.52x slower

Also if you enable 'frozen string literals' (which will be default in future ruby verions) the difference drops to factor of 1.2:

# frozen_string_literal: true

Comparison:
             Symbol::  9014176.3 i/s
             String::  7532196.9 i/s - 1.20x slower
Pascal
  • 8,464
  • 1
  • 20
  • 31
0

An additional overhead for strings as hash keys is that since strings are mutable, and also commonly used has hash keys, the Hash class makes a copy of all string keys (likely with a method like dup or clone) in order to protect the integrity of the hash from key damage.

Consider:

irb(main):001:0> a = {}
=> {}
irb(main):002:0> b = "fred"
=> "fred"
irb(main):003:0> a[b] = 42
=> 42
irb(main):004:0> a
=> {"fred"=>42}
irb(main):005:0> b << " flintstone"
=> "fred flintstone"
irb(main):006:0> a
=> {"fred"=>42}
irb(main):007:0> b
=> "fred flintstone"
irb(main):008:0>
irb(main):008:0> b.object_id
=> 17350536
irb(main):009:0> a.keys[0].object_id
=> 15113052
irb(main):010:0>

Symbols are immutable and need no such drastic measures.

Peter Camilleri
  • 1,882
  • 15
  • 17
  • I'm not sure what you want to prove with your code. The key isn't the variable `b` but the string `"fred"`. You can modify `b` afterwards, `a` doesn't care anymore. You're right though, the key gets duplicated and frozen. see this [thread](https://stackoverflow.com/questions/13044839/why-is-a-string-key-for-a-hash-frozen) – Eric Duminil Jul 16 '17 at 21:20
  • That is pretty much my point. Changes to the original string do not reflect in the hash key because it was duplicated (or cloned?). Symbols do not need this treatment which speaks to the original point of the question about speed advantages. Maybe I should have shown how the keys were frozen? However, I am led to believe that freezing is not as time intensive as copying. – Peter Camilleri Jul 16 '17 at 21:29