64

How useful is the feature of having an atom data type in a programming language?

A few programming languages have the concept of atom or symbol to represent a constant of sorts. There are a few differences among the languages I have come across (Lisp, Ruby and Erlang), but it seems to me that the general concept is the same. I am interested in programming language design, and I was wondering what value does having an atom type provide in real life. Other languages such as Python, Java, C# seem to be doing quite well without it.

I have no real experience of Lisp or Ruby (I know the syntaxes, but haven't used either in a real project). I have used Erlang enough to be used to the concept there.

jub0bs
  • 60,866
  • 25
  • 183
  • 186
Muhammad Alkarouri
  • 23,884
  • 19
  • 66
  • 101

13 Answers13

54

Atoms are literals, constants with their own name for value. What you see is what you get and don't expect more. The atom cat means "cat" and that's it. You can't play with it, you can't change it, you can't smash it to pieces; it's cat. Deal with it.

I compared atoms to constants having their name as their values. You may have worked with code that used constants before: as an example, let's say I have values for eye colors: BLUE -> 1, BROWN -> 2, GREEN -> 3, OTHER -> 4. You need to match the name of the constant to some underlying value. Atoms let you forget about the underlying values: my eye colors can simply be 'blue', 'brown', 'green' and 'other'. These colors can be used anywhere in any piece of code: the underlying values will never clash and it is impossible for such a constant to be undefined!

taken from http://learnyousomeerlang.com/starting-out-for-real#atoms

With this being said, atoms end up being a better semantic fit to describing data in your code in places other languages would be forced to use either strings, enums or defines. They're safer and friendlier to use for similar intended results.

I GIVE TERRIBLE ADVICE
  • 9,578
  • 2
  • 32
  • 40
38

A short example that shows how the ability to manipulate symbols leads to cleaner code: (Code is in Scheme, a dialect of Lisp).

(define men '(socrates plato aristotle))

(define (man? x) 
    (contains? men x))

(define (mortal? x) 
    (man? x))

;; test

> (mortal? 'socrates)
=> #t

You can write this program using character strings or integer constants. But the symbolic version has certain advantages. A symbol is guaranteed to be unique in the system. This makes comparing two symbols as fast as comparing two pointers. This is obviously faster than comparing two strings. Using integer constants allows people to write meaningless code like:

(define SOCRATES 1)
;; ...

(mortal? SOCRATES)
(mortal? -1) ;; ??

Probably a detailed answer to this question could be found in the book Common Lisp: A Gentle Introduction to Symbolic Computation.

cobbal
  • 69,903
  • 20
  • 143
  • 156
Vijay Mathew
  • 26,737
  • 4
  • 62
  • 93
  • 1
    Upvote for Touretsky's book! It's one of my favorite Lisp texts. – yonkeltron Feb 02 '11 at 14:07
  • 1
    So a symbol is a global efficient constant with some sort of type checking, right? And thanks for the book. – Muhammad Alkarouri Feb 02 '11 at 16:05
  • 4
    Muhammad, an atom is a string constant same way as an integer value is. When you see 1 in the code, it simply means 1; if you see 1.3f, then it means 1.3f. Same way an atom foo means foo. – damg Feb 02 '11 at 17:51
  • 1
    In C# strings are also guaranteed to point to the same address if they have identical values. – Egor Pavlikhin Feb 06 '11 at 23:27
  • 1
    @HeavyWave, that is not strictly correct, there is no "guarantee" of string interning. String Interning is *possible*, but is not required. String that are stored directly in the executable *are* interned by default, but any time you call the string constructor, you are creating a new instance. – John Gietzen Aug 27 '11 at 13:30
  • @EgorPavlikhin do those strings point to same address in different processes? or on different network machines? no they don't. – keymone Dec 12 '11 at 15:09
14

Atoms (in Erlang or Prolog, etc.) or symbols (in Lisp or Ruby, etc.)—from herein only called atoms—are very useful when you have a semantic value that has no natural underlying "native" representation. They take the space of C-style enums like this:

enum days { MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY, SUNDAY }

The difference is that atoms don't typically have to be declared and they have NO underlying representation to worry about. The atom monday in Erlang or Prolog has the value of "the atom monday" and nothing more or less.

While it is true that you could get much of the same use out of string types as you would out of atoms, there are some advantages to the latter. First, because atoms are guaranteed to be unique (behind the scenes their string representations are converted into some form of easily-tested ID) it is far quicker to compare them than it is to compare equivalent strings. Second, they are indivisible. The atom monday cannot be tested to see if it ends in day for example. It is a pure, indivisible semantic unit. You have less conceptual overloading than you would in a string representation in other words.

You could also get much of the same benefit with C-style enumerations. The comparison speed in particular is, if anything, faster. But... it's an integer. And you can do weird things like have SATURDAY and SUNDAY translate to the same value:

enum days { SATURDAY, SUNDAY = 0, MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY }

This means you can't trust different "symbols" (enumerations) to be different things and thus makes reasoning about code a lot more difficult. Too, sending enumerated types through a wire protocol is problematical because there's no way to distinguish between them and regular integers. Atoms do not have this problem. An atom is not an integer and will never look like one behind the scenes.

JUST MY correct OPINION
  • 35,674
  • 17
  • 77
  • 99
  • 1
    +1 But don't forget, for example, [`erlang:atom_to_list/1`](http://www.erlang.org/doc/man/erlang.html#atom_to_list-1) and its opposite [`erlang:list_to_atom/1`](http://www.erlang.org/doc/man/erlang.html#list_to_atom-1). They allow you to convert between atoms and strings (lists). It's discouraged though :-) – YasirA Feb 02 '11 at 14:59
  • 1
    Yasir: But a conversion, by definition, means it's no longer an atom (or a list, depending on direction). – JUST MY correct OPINION Feb 02 '11 at 15:09
  • I was commenting your *"The atom monday cannot be tested to see if it ends in `day` for example."* part WRT Erlang. Also, you forgot to put `@` in front of my name, I wouldn't notice your comment :-) – YasirA Feb 02 '11 at 15:21
  • @Yasir Arsanukaev: I know what you were commenting on. I was pointing out that if you convert the atom to a list, you're not comparing part of an atom any longer. You're comparing a list (as a string). Just like I can compare if the bottom end of an integer is "1671" by converting to a string -- it's not comparing integers any longer. – JUST MY correct OPINION Feb 02 '11 at 15:49
13

As a C programmer I had a problem with understanding what Ruby symbols really are. I was enlightened after I saw how symbols are implemented in the source code.

Inside Ruby code, there is a global hash table, strings mapped to integers. All ruby symbols are kept there. Ruby interpreter, during source code parse stage, uses that hash table to convert all symbols to integers. Then internally all symbols are treated as integers. This means that one symbol occupies only 4 bytes of memory and all comparisons are very fast.

So basically you can treat Ruby symbols as strings which are implemented in a very clever way. They look like strings but perform almost like integers.

When a new string is created, then in Ruby a new C structure is allocated to keep that object. For two Ruby strings, there are two pointers to two different memory locations (which may contain the same string). However a symbol is immediately converted to C int type. Therefore there is no way to distinguish two symbols as two different Ruby objects. This is a side effect of the implementation. Just keep this in mind when coding and that's all.

Greg Dan
  • 6,198
  • 3
  • 33
  • 53
12

In Lisp symbol and atom are two different and unrelated concepts.

Usually in Lisp an ATOM is not a specific data type. It is a short hand for NOT CONS.

(defun atom (item)
  (not (consp item)))

Also the type ATOM is the same as the type (NOT CONS).

Anything that is not a cons cell is an atom in Common Lisp.

A SYMBOL is a specific datatype.

A symbol is an object with a name and identity. A symbol can be interned in a package. A symbol can have a value, a function and a property list.

CL-USER 49 > (describe 'FOO)

FOO is a SYMBOL
NAME          "FOO"
VALUE         #<unbound value>
FUNCTION      #<unbound function>
PLIST         NIL
PACKAGE       #<The COMMON-LISP-USER package, 91/256 internal, 0/4 external>

In Lisp source code the identifiers for variables, functions, classes and so on are written as symbols. If a Lisp s-expression is read by the reader, it does create new symbols if they are not known (available in the current package) or reuses an existing symbol (if it is available in the current package. If the Lisp reader reads a list like

(snow snow)

then it creates a list of two cons cells. The CAR of each cons cell point to the same symbol snow. There is only one symbol for it in the Lisp memory.

Also note that the plist (the property list) of a symbol can store additional meta information for a symbol. This could be the author, a source location, etc. The user can also use this feature in his/her programs.

Rainer Joswig
  • 136,269
  • 10
  • 221
  • 346
  • 4
    All very interesting and true, but not answering the question. The question is talking about the "atom data type" which, given the OP's comment about knowing Erlang, would be referring to what Erlang calls an atom and what Lisp calls a symbol (as does Ruby if memory serves). The clue is contained in "A few programming languages have the concept of atom or symbol to represent a constant of sorts. There are a few differences among the languages I have come across (Lisp, Ruby and Erlang), but it seems to me that the general concept is the same." – JUST MY correct OPINION Feb 02 '11 at 14:23
  • 3
    @JUST MY correct OPINION: The OP was talking about 'Atom' in Lisp and Erlang. Also about Symbols in Ruby and Scheme. I explained that ATOM and Symbols are not related, so his question makes limited sense. I explained then the difference between ATOMs and Symbols in Lisp, and what is offered by Symbols. – Rainer Joswig Feb 02 '11 at 14:28
  • 1
    @JUST MY correct OPINION: Naming constants is only one use case for symbols in Lisp. Symbols are mostly used as identifiers for some concept (function, variable, class) with possibly added metadata. In Ruby a symbol is comparable to what Lisp calls a keyword symbol. But that has limited use. It has not the attributes a Lisp symbol has. A keyword symbol in Lisp evaluates always to itself and is in the keyword package. – Rainer Joswig Feb 02 '11 at 14:35
  • 2
    Thanks. I mixed up the terminology in Lisp. I was thinking of alphanumeric atoms, which are properly symbols in Lisp. While my question was about Erlang symbols, your answer was definitely useful in removing my confusion. – Muhammad Alkarouri Feb 02 '11 at 16:14
6

In Scheme (and other members of the Lisp family), symbols are not just useful, they are essential.

An interesting property of these languages is that they are homoiconic. A Scheme program or expression can itself be represented as a valid Scheme data structure.

An example might make this clearer (using Gauche Scheme):

> (define x 3)
x
> (define expr '(+ x 1))
expr
> expr
(+ x 1)
> (eval expr #t)
4

Here, expr is just a list, consisting of the symbol +, the symbol x, and the number 1. We can manipulate this list like any other, pass it around, etc. But we can also evaluate it, in which case it will be interpreted as code.

In order for this to work, Scheme needs to be able to distinguish between symbols and string literals. In the example above, x is a symbol. It cannot be replaced with a string literal without changing the meaning. If we take a list '(print x), where x is a symbol, and evaluate it, that means something else than '(print "x"), where "x" is a string.

The ability to represent Scheme expressions using Scheme data structures is not just a gimmick, by the way; reading expressions as data structures and transforming them in some way, is the basis of macros.

Hans Nowak
  • 7,600
  • 1
  • 18
  • 18
4

You're actually not right in saying python has no analogue to atoms or symbols. It's not difficult to make objects that behave like atoms in python. Just make, well, objects. Plain empty objects. Example:

>>> red = object()
>>> blue = object()
>>> c = blue
>>> c == red
False
>>> c == blue
True
>>> 

TADA! Atoms in python! I use this trick all the time. Actually, you can go further than that. You can give these objects a type:

>>> class Colour:
...  pass
... 
>>> red = Colour()
>>> blue = Colour()
>>> c = blue
>>> c == red
False
>>> c == blue
True
>>> 

Now, your colours have a type, so you can do stuff like this:

>>> type(red) == Colour
True
>>> 

So, that's more or less equivalent in features to lispy symbols, what with their property lists.

enigmaticPhysicist
  • 1,518
  • 16
  • 21
3

Atoms are guaranteed to be unique and integral, in contrast to, e. g., floating-point constant values, which can differ because of inaccuracy while you're encoding, sending them over the wire, decoding on the other side and converting back to floating point. No matter what version of interpreter you're using, it ensures that atom has always the same "value" and is unique.

The Erlang VM stores all the atoms defined in all the modules in a global atom table.

There's no Boolean data type in Erlang. Instead the atoms true and false are used to denote Boolean values. This prevents one from doing such kind of nasty thing:

#define TRUE FALSE //Happy debugging suckers

In Erlang, you can save atoms to files, read them back, pass them over the wire between remote Erlang VMs etc.

Just as example I'll save a couple of terms into a file, and then read them back. This is the Erlang source file lib_misc.erl (or its most interesting part for us now):

-module(lib_misc).
-export([unconsult/2, consult/1]).

unconsult(File, L) ->
    {ok, S} = file:open(File, write),
    lists:foreach(fun(X) -> io:format(S, "~p.~n",[X]) end, L),
    file:close(S).

consult(File) ->
    case file:open(File, read) of
    {ok, S} ->
        Val = consult1(S),
        file:close(S),
        {ok, Val};
    {error, Why} ->
        {error, Why}
    end.

consult1(S) ->
    case io:read(S, '') of
    {ok, Term} -> [Term|consult1(S)];
    eof        -> [];
    Error      -> Error
    end.

Now I'll compile this module and save some terms to a file:

1> c(lib_misc).
{ok,lib_misc}
2> lib_misc:unconsult("./erlang.terms", [42, "moo", erlang_atom]).
ok
3>

In the file erlang.terms we'll get this contents:

42.
"moo".
erlang_atom. 

Now let's read it back:

3> {ok, [_, _, SomeAtom]} = lib_misc:consult("./erlang.terms").   
{ok,[42,"moo",erlang_atom]}
4> is_atom(SomeAtom).
true
5>

You see that the data is successfully read from the file and the variable SomeAtom really holds an atom erlang_atom.


lib_misc.erl contents are excerpted from "Programming Erlang: Software for a Concurrent World" by Joe Armstrong, published by The Pragmatic Bookshelf. The rest source code is here.

YasirA
  • 9,531
  • 2
  • 40
  • 61
  • All I just said can be true for Erlang. Not sure about other languages, mentioned in the question. – YasirA Feb 02 '11 at 15:23
  • An aside: are they unique across Erlang VM invocations? Can I store an atom and read it later? – Muhammad Alkarouri Feb 02 '11 at 16:10
  • @Muhammad Alkarouri: All erlang terms are serializable to a binary format with functions such as `term_to_binary(Atom)`. A serialized atom in Erlang will have a specific tag at the beginning of the binary saying it is indeed an atom, and will then have a textual representation of itself within the binary value. When unpacking the atom (using functions like `binary_to_term(Bin)`), the VM looks it up into its current atom table. If it's there, it gets the existing unique ID. If it's not there, a new one is attributed. This allows for safe distribution and storage of atoms. – I GIVE TERRIBLE ADVICE Feb 02 '11 at 16:18
  • I think more interesting than the serialization/deserialization of the atoms is the options list accepted by `file:open/2`! You don't have to handle a bunch of constants or binary `OR` them or anything. Just give them as they are or as in a list and it'll work. Want to add an option? simply write the code for it. No need for defines and special cases. Equality testing works it fine. – I GIVE TERRIBLE ADVICE Feb 02 '11 at 16:32
  • I second @I GIVE TERRIBLE ADVICE, and there's a full [External Term Format](http://erlang.org/doc/apps/erts/erl_ext_dist.html) specification. There's also [BERT-RPC](http://bert-rpc.org) specification, which is being developed and used in production within the infrastructure of GitHub and play a part in serving nearly every page of the site. I've developed BERT and BERT-RPC client libraries for some Scheme implementations, and terms and atoms in particular are identical on either sides in spite they're being sent over the wire. – YasirA Feb 02 '11 at 16:33
3

In some languages, associative array literals have keys that behave like symbols.

In Python[1], a dictionary.

d = dict(foo=1, bar=2)

In Perl[2], a hash.

my %h = (foo => 1, bar => 2);

In JavaScript[3], an object.

var o = {foo: 1, bar: 2};

In these cases, foo and bar are like symbols, i.e., unquoted immutable strings.

[1] Proof:

x = dict(a=1)
y = dict(a=2)

(k1,) = x.keys()
(k2,) = y.keys()

assert id(k1) == id(k2)

[2] This is not quite true:

my %x = (a=>1);
my %y = (a=>2);

my ($k1) = keys %x;
my ($k2) = keys %y;

die unless \$k1 == \$k2; # dies

[1] In JSON, this syntax is not allowed because keys must be quoted. I don't know how to prove they are symbols because I don't know how to read the memory of a variable.

tantalor
  • 119
  • 6
2

Atoms are like an open enum, with infinite possible values, and no need to declare anything up front. That is how they're typically used in practice.

For example, in Erlang, a process is expecting to receive one of a handful of message types, and it's most convenient to label the message with an atom. Most other languages would use an enum for the message type, meaning that whenever I want to send a new type of message, I have to go add it to the declaration.

Also, unlike enums, sets of atom values can be combined. Suppose I want to monitor my Erlang process's status, and I have some standard status monitoring tool. I can extend my process to respond to the status message protocol as well as my other message types. With enums, how would I solve this problem?

enum my_messages {
  MSG_1,
  MSG_2,
  MSG_3
};

enum status_messages {
  STATUS_HEARTBEAT,
  STATUS_LOAD
};

The problem is MSG_1 is 0, and STATUS_HEARTBEAT is also 0. When I get a message of type 0, what is it? With atoms, I don't have this problem.

Atoms/symbols are not just strings with constant-time comparison :).

Cosmin
  • 21,216
  • 5
  • 45
  • 60
Sean
  • 1,785
  • 10
  • 15
2

In Ruby, symbols are often used as keys in hashes, so often that Ruby 1.9 even introduced a shorthand for constructing a hash. What you previously wrote as:

{:color => :blue, :age => 32}

can now be written as:

{color: :blue, age: 32}

Essentially, they are something between strings and integers: in source code they resemble strings, but with considerable differences. The same two strings are in fact different instances, while the same symbols are always the same instance:

> 'foo'.object_id
# => 82447904 
> 'foo'.object_id
# => 82432826 
> :foo.object_id
# => 276648 
> :foo.object_id
# => 276648 

This has consequences both with performance and memory consumption. Also, they are immutable. Not meant to be altered once when assigned.

An arguable rule of thumb would be to use symbols instead of strings for every string not meant for output.

Although perhaps seeming irrelevant, most code-highlighting editors colour symbols differently than the rest of the code, making the visual distinction.

Mladen Jablanović
  • 43,461
  • 10
  • 90
  • 113
2

The problem I have with similar concepts in other languages (eg, C) can be easily expressed as:

#define RED 1
#define BLUE 2

#define BIG 1
#define SMALL 2

or

enum colors { RED, BLUE  };
enum sizes  { BIG, SMALL };

Which causes problems such as:

if (RED == BIG)
    printf("True");
if (BLUE == 2)
    printf("True");

Neither of which really make sense. Atoms solve a similar problem without the drawbacks noted above.

ktr
  • 696
  • 9
  • 15
1

Atoms provide fast equality testing, since they use identity. Compared to enumerated types or integers, they have better semantics (why would you represent an abstract symbolic value by a number anyway?) and they are not restricted to a fixed set of values like enums.

The compromise is that they are more expensive to create than literal strings, since the system needs to know all exising instances to maintain uniqueness; this costs time mostly for the compiler, but it costs memory in O(number of unique atoms).

Damien Pollet
  • 6,488
  • 3
  • 27
  • 28
  • 1
    In Lisp the symbols don't cost much for the compiler, since the lookup is done already by the 'reader'. – Rainer Joswig Feb 02 '11 at 14:00
  • 2
    `O(NumberOfAtoms)` is not necessarily right -- All you need is to have a sane unique id generation scheme (Erlang uses references, which are incrementing values bound to the VM's lifetime) making new atoms is mostly a free operation that needs not to be considered. In the case of Erlang, atoms are not GC'ed though, so it's usually a bad idea to generate them dynamically anyway. – I GIVE TERRIBLE ADVICE Feb 02 '11 at 14:09
  • Wouldn't you be using O(NumberOfUniqueStrings) in a string-based alternative to atoms/symbols? And I'd guess that it's more O(1) than O(n) since, as I GIVE TERRIBLE ADVICE noted, you just need a sane ID generation system. – JUST MY correct OPINION Feb 02 '11 at 14:17
  • 1
    Having re-read the comment better, in Erlang's case, you do need `O(LengthOfAllStrings+NUniqueIDs)` in terms of storage. However, each active use of the atom in the code doesn't require to know the string itself and only the ID can be used. Different implementations (i.e. Prolog) will have garbage collection of atoms, and you can bet that depending on the actual application, different tradeoffs will be done: using the same string 5000 times vs. using 5 atoms a thousand times give different memory usage results – I GIVE TERRIBLE ADVICE Feb 02 '11 at 14:23
  • I was thinking of Smalltalk symbols, where the system has a collection of all instances but ensures it reuses those instead of allocating a new one with the same name. Also that is compatible with garbage collection because the system-wide set of symbols would typically use weak references. // PS. What do you mean by "a sane ID generation system"? (In Smalltalk a Symbol is a kind of String and the ID is its identity, e.g. its pointer) – Damien Pollet Feb 02 '11 at 20:09