40

I have used mathematica mostly as a mathematics workbench and for writing relatively small ad-hoc programs. I am however designing a system which I intend to program in Mathematica. I need to store data in a tree, and search and traverse the tree. Although I know how to implement a tree, I prefer standard, tested code. I looked at what sort of packages there are for basic datastructures at the Mathematica users wiki. I have found none, although there is a small example in the Mathematica documentation.

Now to my question(s):

  1. Is there an ( open source ) package for data structures available somewhere?

  2. What approach have you used with regard to data structures? Gradually developing your own util package?

( Not a question, just a remark. Maybe... the lack of ( lots of available ) open source packages is the reason why Mathematica doesn't have the momentum it deserves. A chicken / egg issue, I am afraid. )

rcollyer
  • 10,475
  • 4
  • 48
  • 75
nilo de roock
  • 4,077
  • 4
  • 34
  • 62
  • 7
    Not an answer to your question, but there is the [old talk by Daniel Lichtblau](http://library.wolfram.com/infocenter/Conferences/388/) that discusses data structures in Mathematica. – Simon May 23 '11 at 12:40
  • Interesting! Will read it, thanks. – nilo de roock May 23 '11 at 12:50
  • Also for some traversal code might look at http://demonstrations.wolfram.com/GraphSearchingBreadthFirstAndDepthFirst/ – Daniel Lichtblau May 23 '11 at 12:54
  • 3
    I very much agree with your remark on open source. I think one of the problems is that there is no automatic packaging system which would allow to easily and automatically use the work of others (like jars in Java) and the standards for writing packages are not strict enough. – Leonid Shifrin May 23 '11 at 12:56
  • 1
    @Simon @ndrook1 - You might be interested in this package on MathSource as well: http://library.wolfram.com/infocenter/MathSource/4378/ – telefunkenvf14 May 24 '11 at 22:03
  • @telefunk: Thanks, I didn't know about that one! @Mr.W Thanks! I had almost forgotten how fast you have to be to answer a question around here... – Simon May 24 '11 at 22:18

2 Answers2

44

In Mathematica, most of what you do is based on expressions. Expressions naturally have the tree structure. For depth-first traversals (which are probably most common), you can then use functions like Scan,Map, Cases etc. The difference w.r.t the more traditional languages is that there is no simple way to preserve the identity of individual node in an expression tree, since there are no pointers in Mathematica. Also, many operations on expressions that are idiomatic in Mathematica would copy the entire expression when you only need to modify it in a few places, because expressions are immutable.

Using immutable Mathematica expressions as trees still has several advantages. One is that, because they are immutable, it is easy to understand what they store by just looking at them (state and behavior are not mixed). Another is that there are efficient and generic functions such as Map, MapIndexed or Scan, that traverse them. For example, the visitor design pattern is invisible - it is just Map[f,tree,Infinity], built into the langauge. Also, there are built-in functions such as Cases, Replace, ReplaceAll, etc, which allow one to write very concise and declarative code to destructure trees, find pieces of trees with certain syntax or satisfying some condition, etc. Since trees are not restricted to only be built from lists and be built from different heads, one can effectively use this to write very concise tree-processing code. Finally, a freedom to very easily build any tree structure you want makes it much easier to perform experiments and prototype things, in the spirit of exploratory and bottom-up programming, which shortens the development cycle and ultimately leads to better designs.

That said, you can certainly implement "stateful" (mutable) tree data structure. The real reason it has not been done yet generally is, I suspect, the performance hit associated with building, modifying and traversing such a tree, since it will undergo a full symbolic evaluation process at every step (see this post for more details on that). For 2 examples of how, for example, a binary search tree can be used in Mathematica context for pretty efficient code, see my posts here (generic symbolic setting) and here (in the context of Compiled code). For general ways to construct data structures idiomatically in Mathematica, I recommend books of Roman Maeder: "Programming in Mathematica", "Mathematica programmer I&II", and especially "Computer Science in Mathematica". In the latter he has a detailed discussion of how to implement binary search tree in Mathematica. EDIT As @Simon mentioned, the talk of @Daniel Lichtblau is also a great resource, which shows how to build data structures and make them efficient.

Regarding general ways to implement data structures in Mathematica which would incorporate some state, here is a simple example extracted from my post in this Mathgroup thread - it implements a "pair" data structure.

Unprotect[pair, setFirst, getFirst, setSecond, getSecond, new, delete];
ClearAll[pair, setFirst, getFirst, setSecond, getSecond, new, delete];
Module[{first, second},
  first[_] := {};
  second[_] := {};
  pair /: new[pair[]] := pair[Unique[]];
  pair /: pair[tag_].delete[] := (first[tag] =.; second[tag] =.);
  pair /: pair[tag_].setFirst[value_] := first[tag] = value;
  pair /: pair[tag_].getFirst[] := first[tag];
  pair /: pair[tag_].setSecond[value_] := second[tag] = value;
  pair /: pair[tag_].getSecond[] := second[tag];
  Format[pair[x_Symbol]] := "pair[" <> ToString[Hash[x]] <> "]";
];
Protect[pair, setFirst, getFirst, setSecond, getSecond, new, delete]; 

Here is how you could use it:

pr = new[pair[]];
pr.setFirst[10];
pr.setSecond[20];
{pr.getFirst[], pr.getSecond[]}

{10, 20}

Creating a list of new pair objects :

pairs = Table[new[pair[]], {10}]

{"pair[430427975]", "pair[430428059]", "pair[430428060]", "pair[430428057]",
"pair[430428058]", "pair[430428063]", "pair[430428064]", "pair[430428061]", 
"pair[430428062]", "pair[430428051]"}

Setting the fields :

Module[{i},
 For[i = 1, i <= 10, i++,
  pairs[[i]].setFirst[10*i];
  pairs[[i]].setSecond[20*i];]]

Checking the fields :

#.getFirst[] & /@ pairs

{10, 20, 30, 40, 50, 60, 70, 80, 90, 100}

#.getSecond[] & /@ pairs

{20, 40, 60, 80, 100, 120, 140, 160, 180, 200} 

In the post I mentioned there is a more detailed discussion. One big problem for "objects" created in this way is that there is no automatic garbage collection for them, which may be one of the major reasons why OOP extensions implemented in top-level Mathematica itself did not really take off.

There are several OOP extensions for Mathematica, such as the classes.m package by Roman Maeder (the source is in his "Mathematica Programmer" book), the Objectica commercial package, and several others. But until Mathematica itself would provide efficient mechanisms (perhaps based on some kind of pointer or reference mechanism) for building mutable data structures (if this ever happens), there will probably be a large performance hit associated with top-level implementations of such data structures in mma. Also, since mma is based on immutability as one of the core ideas, it is not very easy to make mutable data structures fit well with other idioms of Mathematica programming.

EDIT

Here is a bare-bones stateful tree implementation along the lines of the example above:

Module[{parent, children, value},
  children[_] := {};
  value[_] := Null;
  node /: new[node[]] := node[Unique[]];
  node /: node[tag_].getChildren[] := children[tag];
  node /: node[tag_].addChild[child_node, index_] := 
        children[tag] = Insert[children[tag], child, index];
  node /: node[tag_].removeChild[index_] := 
        children[tag] = Delete[children[tag], index];
  node /: node[tag_].getChild[index_] := children[tag][[index]];
  node /: node[tag_].getValue[] := value[tag];
  node /: node[tag_].setValue[val_] := value[tag] = val;
];

Some examples of use:

In[68]:= root = new[node[]]

Out[68]= node[$7]

In[69]:= root.addChild[new[node[]], 1]

Out[69]= {node[$8]}

In[70]:= root.addChild[new[node[]], 2]

Out[70]= {node[$8], node[$9]}

In[71]:= root.getChild[1].addChild[new[node[]], 1]

Out[71]= {node[$10]}

In[72]:= root.getChild[1].getChild[1].setValue[10]

Out[72]= 10

In[73]:= root.getChild[1].getChild[1].getValue[]

Out[73]= 10

For one non-trivial example of use of this mutable tree data structure, see this post of mine. It also confronts this method with the one more heavily reusing Mathematica native data structures and functions, and illustrates well the points discussed at the beginning of this post.

Community
  • 1
  • 1
Leonid Shifrin
  • 22,449
  • 4
  • 68
  • 100
  • Thank you, Leonid for this fast and extensive answer. I will try to get the literature you mentioned. I have played with Roman Maeder's OO package last year but for now that is off-topic. – nilo de roock May 23 '11 at 13:13
  • Thank you, Leonid for this fast and extensive answer. I will try to get the literature you mentioned. I have played with Roman Maeder's OO package last year but for now that is off-topic. Anyway: Sofar I have functions for NewTree NewNode AddAsChild AddAsSibling Traverse A node is then a list with {Id, Object, Previous, Parent}. I go through the literature first. – nilo de roock May 23 '11 at 13:21
  • I added a bare-bones implementation of stateful tree in the edit, along the lines of my example - may be you may find this useful. – Leonid Shifrin May 23 '11 at 13:31
  • Thanks, I'll give it a go this evening. Will report back. – nilo de roock May 23 '11 at 13:35
  • +1 Nice ideas! A few questions: Is a hash collision a problem? Not functionally; it's also not likely, but there will be a user that will be very surprised, now or 1000 years in the future. And isn't the overloading of Dot dangerous? I see how it works here and it looks sound, but still... Syntax coloring: since you don't actually provide a formal definition of the methods they appear as undefined (blue) in the editor. – Sjoerd C. de Vries May 23 '11 at 15:37
  • @Sjoerd Regarding hashes, they are for presentation purposes only (formatting rules). The real tags that give identity to objects are unique, which is guaranteed by `Unique[]`. Overloading `Dot` should not be dangerous since it is done via `UpValues` and affects only expressions of the form `node[tag]`. As for highlighting, it is very important that the methods themselves *do not* have formal definitions (`DownValues` etc), so the blue color is actually a good sign - this is also why I protected them in the example. This also allows different types to have methods with the same names. – Leonid Shifrin May 23 '11 at 15:44
  • @Sjoerd Using hashes is probably indeed a bad idea (and I had that feeling from the start :)). In the tree example in my edit I don't use them, but use tags directly. The idea behind hashes was to make tags completely opaque, so that the user can not break encapsulation by engineering object names manually. But this is unnecessary - a good implementation must be hackable and open-ended. – Leonid Shifrin May 23 '11 at 15:49
  • @Leonid. I got a lot out of your answer and the subsequent remarks. So I'll accept / close the answer for now. ( Let me know if that is too soon as I am fairly new here. ) I I will come back on the subject though. ;-) Thanks again. – nilo de roock May 23 '11 at 20:03
  • 2
    @ndroock1 Thanks for the accept! You can accept an answer as early as you like. You can also check-mark another answer later, if you feel that that new answer is a better one. As for the subject, I think this is indeed a very important topic and perhaps at some point Mathematica should have better support for efficient mutable data structures. I am very much interested in this topic as well. – Leonid Shifrin May 23 '11 at 20:09
  • @ndroock1 And by the way, welcome to StackOverflow! IIRC, I suggested you to have look at it about a month ago on your blog, and I hope you will not be disappointed! – Leonid Shifrin May 23 '11 at 20:16
  • @Mr.Wizard Thanks. I did not expect so many votes. Perhaps, partly this is due to the expansion of our SO community - the number of followers doubled in the 4 months that I am here. – Leonid Shifrin May 24 '11 at 12:38
  • @Leonid Wizard & I did some advertising. Maybe that helped. – Sjoerd C. de Vries May 26 '11 at 16:00
  • @ndroock1 You may want to look at my solution for the question asked in this thread http://stackoverflow.com/questions/6138540/code-manipulation-via-interactive-tree-for-mathematica, for what I think is a good illustration of both how to work with the mutable tree data structure presented here, and how one can often write a much shorter code by reusing Mathematica native structures and operations. – Leonid Shifrin May 27 '11 at 16:07
  • @LeonidShifrin:"the performance hit associated with building, modifying and traversing such a tree, since it will undergo a full symbolic evaluation process at every step". Term rewriting is one of several factors that combine to make Mathematica orders of magnitude slower than conventional compiled languages in this context. Other factors include the data representation (which is optimized for computer algebra, not data structures) and garbage collection (Mathematica's naive reference counting is extremely inefficient). – J D Nov 27 '11 at 14:44
  • @Jon Harrop True, but I wouldn't know the relative importance of the other effects you mentioned off-hand. My intuitive feeling is that issues related to garbage collection and data representation would show up much more for really large expressions. Garbage collection itself should be reasonably fast given that you can not make circular data structures in Mathematica. For data representation, it is obviously based on arrays (rather than linked lists) of pointers, so is inefficient for certain operations, but there are well-known ways to avoid those inefficiencies. – Leonid Shifrin Nov 27 '11 at 14:58
  • @LeonidShifrin: As a rule of thumb, naive reference counting (as in Mathematica) is about 10x slower than a modern garbage collector. http://flyingfrogblog.blogspot.com/2011/01/boosts-sharedptr-up-to-10-slower-than.html – J D Nov 28 '11 at 21:01
  • @Jon Harrop Nice to know, but for this to become a major bottleneck, the computation time should be dominated by allocation / deallocation. This usually happens when lots of small chunks of memory are frequently allocated and released. This will perhaps indeed matter for mutable data structures. – Leonid Shifrin Nov 28 '11 at 21:40
  • @LeonidShifrin: "for this to become a major bottleneck, the computation time should be dominated by allocation / deallocation". Actually that is a common misconception. Garbage collectors usually also incur overhead when the topology of the heap is altered. For example, incrementing and decrementing reference counts when references are acquired and released; or the write barrier in a typical copying collector that is invoked when a pointer is written into the heap. That does not require either allocation or deallocation. – J D Nov 29 '11 at 00:49
  • @Jon Harrop Interesting. Can you recommend some good references on this? – Leonid Shifrin Nov 29 '11 at 06:14
  • @LeonidShifrin: This is a fascinating subject with many great research papers. Richard Jones just published a new seminal monograph "The Garbage Collection Handbook: The Art of Automatic Memory Management" that I highly recommend http://www.amazon.co.uk/Garbage-Collection-Handbook-Management-Algorithms/dp/1420082795/ref=cm_cr_pr_product_top – J D Nov 29 '11 at 11:33
  • I also found this paper about concurrent garbage collection to be very enlightening http://doc.cat-v.org/inferno/concurrent_gc/ – J D Nov 29 '11 at 11:43
  • For overviews, see also Wilson's "Uniprocessor Garbage Collection Techniques (1992)" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.47.2438 and Bacon's "A unified theory of garbage collection (2004)" http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.143.6619 – J D Nov 29 '11 at 11:46
  • @Jon Harrop Thanks a lot for the references, Jon. I hope I will be able to set some time apart to study this. – Leonid Shifrin Nov 29 '11 at 14:04
8

I have used mathematica mostly as a mathematics workbench and for writing relatively small ad-hoc programs.

Mathematica really excels at this.

What approach have you used with regard to data structures? Gradually developing your own util package?

I avoid creating my own data structures in Mathematica because it cannot handle them efficiently. Specifically, general data structures tend to be 10-1,000× slower in Mathematica than elsewhere which greatly limits their practical usefulness. For example, Mathematica is 100× slower than F# at computing the range of depths in a red-black tree.

Logic programming with lists is one example where Mathematica is typically orders of magnitude slower than other compiled languages. The following Mathematica program uses linked lists to solve the n-queens problem:

safe[{x0_, y0_}][{x1_, y1_}] := 
 x0 != x1 && y0 != y1 && x0 - y0 != x1 - y1 && x0 + y0 != x1 + y1

filter[_, {}] := {}
filter[p_, {h_, t_}] := If[p[h], {h, filter[p, t]}, filter[p, t]]

search[n_, nqs_, qs_, {}, a_] := If[nqs == n, a + 1, a]
search[n_, nqs_, qs_, {q_, ps_}, a_] := 
 search[n, nqs, qs, ps, 
  search[n, nqs + 1, {q, qs}, filter[safe[q], ps], a]]

ps[n_] := 
 Fold[{#2, #1} &, {}, Flatten[Table[{i, j}, {i, n}, {j, n}], 1]]

solve[n_] := search[n, 0, {}, ps[n], 0]

Here is the equivalent F#:

let safe (x0, y0) (x1, y1) =
  x0<>x1 && y0<>y1 && x0-y0<>x1-y1 && x0+y0<>x1+y1

let rec filter f = function
  | [] -> []
  | x::xs -> if f x then x::filter f xs else filter f xs

let rec search n nqs qs ps a =
  match ps with
  | [] -> if nqs=n then a+1 else a
  | q::ps ->
      search n (nqs+1) (q::qs) (filter (safe q) ps) a
      |> search n nqs qs ps

let ps n =
  [ for i in 1..n do
      for j in 1..n do
        yield i, j ]

let solve n = search n 0 [] (ps n) 0

solve 8

Solving the 8-queens problem takes 10.5s with Mathematica and 0.07s with F#. So F# is 150× faster than Mathematica in this case.

The Stack Overflow question Mathematica "linked lists" and performance gives a more extreme example. Naive translation of that Mathematica code into F# gives an equivalent program that runs between 4,000 and 200,000× faster than the Mathematica:

let rand = System.Random()
let xs = List.init 10000 (fun _ -> rand.Next 100)
Array.init 100 (fun _ ->
  let t = System.Diagnostics.Stopwatch.StartNew()
  ignore(List.length xs)
  t.Elapsed.TotalSeconds)

Specifically, Mathematica takes 0.156s to 16s to perform a single iteration whereas the F# takes 42µs to 86µs.

If I really want to stay in Mathematica then I shoehorn everything I'm doing into Mathematica's handful of built-in data structures, e.g. Dispatch.

Community
  • 1
  • 1
J D
  • 48,105
  • 13
  • 171
  • 274
  • 2
    Occasionally I do a Project Euler exercise and when done I compare the ( performance ) results with many other languages and solutions. Mathematica does not rank as slow in that competition. For my purposes Mathematica is sufficient. And when it is not it usually is so far out of range that no other language can handle it either. - If you want speed with Mathematica you can compile to C as an intermediate language and tune from there. – nilo de roock Nov 28 '11 at 12:22
  • 1
    @niloderoock: I am not familiar with project Euler but, judging by this webpage http://en.wikibooks.org/wiki/Collection_of_Computer_Programs_on_Project_Euler with solutions to the first 26 problems in Mathematica and F#, it looks like the problems don't require any custom data structures which was the subject of this Stack Overflow question. – J D Nov 28 '11 at 18:00
  • Regarding your latest update, which code from that page are you comparing? – Mr.Wizard Nov 28 '11 at 18:32
  • @Mr.Wizard - I haven't done Euler sums for months I don't remember exactly. A lot of players use Mma as their language. Mma ranks #2 after K in compactness of the code btw. – nilo de roock Nov 28 '11 at 20:05
  • @nilo, I was addressing Jon Harrop. :-) – Mr.Wizard Nov 28 '11 at 20:08
  • @JonHarrop - Is a Java ArrayList custom or not? No: since, it is part of the util package, yes because it is entirely written in Java using Java primitive types. - Similar reasoning could be done for mma but unfortunately mma is closed source, so it's difficult to draw the line. – nilo de roock Nov 28 '11 at 20:13
  • Jon, is it the `getTimes` code, but with `RandomInteger[100, 100000]`? Also, the 16s time is a bug for which a workaround is given, correct? F# is then 300X faster, which is still a heck of a difference on this operation. – Mr.Wizard Nov 28 '11 at 21:28
  • 3
    @Jon Harrop For the queens problem, I have an entirely top-level solution in Mathematica which runs in `0.15` seconds for the size `8`, and is very short also. I agree though that straightforward solutions in M often are slow. But, M wasn't designed to win the language shootouts, and I like it for what it offers (workflow, productivity, ways to experiment with ideas, interactivity, etc, etc). When I need speed, I will take your F# code and link it to M via .Net Link :) – Leonid Shifrin Nov 28 '11 at 23:13
  • @Mr.Wizard: Err, no. I accidentally ran the original Mathematica with 10,000 but ran the F# with 100,000 so F# is actually another 10x faster than I had thought... – J D Nov 28 '11 at 23:40
  • @LeonidShifrin: Is your top-level solution using linked lists? The point of my answer is that it is a literal translation of a common data structure. I agree about language shootouts but Mathematica is really bad for custom data structures. Anything outside its standard library is grindingly slow. – J D Nov 28 '11 at 23:42
  • @Jon Harrop Ok, agree on custom data structures - really bad indeed - this has been also my feeling for a long time already. My solution is not using linked lists, and I don't dispute that you have a point. I am saying that one *can* find efficient solutions to many problems in M, it is just not always straightforward. – Leonid Shifrin Nov 28 '11 at 23:50
  • 1
    @LeonidShifrin: Stephen Wolfram's fascination with cellular automata drove WRI to develop some awesome ways to evaluate them efficiently in Mathematica. That is closely related to the idea of solving problems using completely different data structures. Perhaps there is some irony in striving to reduce computational complexity in order to study irreducible complexity. :-) – J D Nov 29 '11 at 01:16