49

I have a dream to improve the world of distributed programming :)

In particular, I'm feeling a lack of necessary tools for debugging, monitoring, understanding and visualizing the behavior of distributed systems (heck, I had to write my own logger and visualizers to satisfy my requirements), and I'm writing a couple of such tools in my free time.

Community, what tools do you lack with this regard? Please describe one per answer, with a rough idea of what the tool would be supposed to do. Others can point out the existence of such tools, or someone might get inspired and write them.

jkff
  • 17,623
  • 5
  • 53
  • 85
  • 5
    CW has been removed as a shield against crappy questions but I'm of the opinion this one's okay, falling under the FAQ clause 4: "matters that are unique to the programming profession". Others may disagree. It's certainly a more _useful_ poll question than "what's your favourite programming-related cartoon?" :-) – paxdiablo Nov 23 '10 at 08:43
  • 2
    Reminds me of a discussion a few years ago with a researcher who described MPI as "the assembly of parallel programming" and noted the need for additional tools. This is certainly a useful technical question and it belongs here. I hope that it wouldn't be flagged as community wiki and would even advice editing the question to remove the suggestion that it is subjective or deserves CW status. It is certainly popular. – Muhammad Alkarouri Nov 23 '10 at 11:05

10 Answers10

13

OK, let me start.

A distributed logger with a high-precision global time axis - allowing to register events from different machines in a distributed system with high precision and independent on the clock offset and drift; with sufficient scalability to handle the load of several hundred machines and several thousand logging processes. Such a logger allows to find transport-level latency bottlenecks in a distributed system by seeing, for example, how many milliseconds it actually takes for a message to travel from the publisher to the subscriber through a message queue, etc.

Syslog is not ok because it's not scalable enough - 50000 logging events per second will be too much for it, and timestamp precision will suffer greatly under such load.

Facebook's Scribe is not ok because it doesn't provide a global time axis.

Actually, both syslog and scribe register events under arrival timestamps, not under occurence timestamps.

Honestly, I don't lack such a tool - I've written one for myself, I'm greatly pleased with it and I'm going to open-source it. But others might.

P.S. I've open-sourced it: http://code.google.com/p/greg

jkff
  • 17,623
  • 5
  • 53
  • 85
  • I agree. I've written a few one-off programs to temporally merge logs from multiple distributed systems before. It was a great help in diagnosing an issue we were having. Standard tools to make this kind of diagnostics easier would be welcome. And yes, I want to know the exact time when the event occurred, not the time when system got around to noticing the event and persisting the event. – Mike Clark Nov 23 '10 at 08:55
  • 7
    How did you do it? Doesn't your tool violate the laws of physics? :) According to special relativity there's no such thing as a global absolute time. – Hongli Nov 23 '10 at 10:32
  • 3
    Its precision is finite but sufficiently high for my purposes on a speedy LAN (precision ~= asymmetry of network latencies, t.i. microseconds or at most milliseconds). It measures events client-side and calibrates clock offset with clients. – jkff Nov 23 '10 at 10:43
  • 1
    I'm studying tools like this for my M.Sc. thesis. Very challenging problem to get right. – Tony Arkles Nov 24 '10 at 01:38
  • 1
    @Tony Arkles: Could you elaborate on your thesis and on the challenges? Maybe in a blog post? :) – jkff Nov 24 '10 at 05:24
  • @jfkk: Asked you a question about your logging, specifically the global time axis in your original logging thread, answer with as many details as you can when you have a minute! :) – please delete me Nov 24 '10 at 07:03
  • @Immilewski - you can have it, I've open-sourced it, as I mentioned in my last comment. Please contact me ekirpichov@gmail.com if it lacks something that you need. – jkff Mar 21 '11 at 22:49
9

Dear Santa, I would like visualizations of the interactions between components in the distributed system.

I would like a visual representation showing:

  • The interactions among components, either as a UML collaboration diagram or sequence diagram.
  • Component shutdown and startup times as self-interactions.
  • On which hosts components are currently running.
  • Location of those hosts, if available, within a building or geographically.
  • Host shutdown and startup times.

I would like to be able to:

  • Filter the components and/or interactions displayed to show only those of interest.
  • Record interactions.
  • Display a desired range of time in a static diagram.
  • Play back the interactions in an animation, with typical video controls for playing, pausing, rewinding, fast-forwarding.

I've been a good developer all year, and would really like this.

Andy Thomas
  • 84,978
  • 11
  • 107
  • 151
8

Then again, see this question - How to visualize the behavior of many concurrent multi-stage processes?.

alt text

(I'm shamelessly refering to my own stuff, but that's because the problems solved by this stuff were important for me, and the current question is precisely about problems that are important for someone).

Community
  • 1
  • 1
jkff
  • 17,623
  • 5
  • 53
  • 85
4

You could have a look at some of the tools that come with erlang/OTP. It doesn't have all the features other people suggested, but some of them are quite handy, and built with a lot of experience. Some of these are, for instance:

  • Debugger that can debug concurrent processes, also remotely, AFAIR
  • Introspection tools for mnesia/ets tables as well as process heaps
  • Message tracing
  • Load monitoring on local and remote nodes
  • distributed logging and error report system
  • profiler which works for distributed scenarios
  • Process/task/application manager for distributed systems

These come of course in addition to the base features the platform provides, like Node discovery, IPC protocol, RPC protocols & services, transparent distribution, distributed built-in database storage, global and node-local registry for process names and all the other underlying stuff that makes the platform tic.

1

I think this is a great question and here's my 0.02 on a tool I would find really useful. One of the challenges I find with distributed programming is in the deployment of code to multiple machines. Quite often these machines may have slightly varying configuration or worse have different application settings.

The tool I have in mind would be one that could on demand reach out to all the machines on which the application is deployed and provide system information. If one specifies a settings file or a resource like a registry, it would provide the list for all the machines. It could also look at the user access privileges for the users running the application.

A refinement would be to provide indications when settings are not matching a master list provided by the developer. It could also indicate servers that have differing configurations and provide diff functionality.

This would be really useful for .NET applications since there are so many configurations (machine.config, application.config, IIS Settings, user permissions, etc) that the chances of varying configurations are high.

Nikhil
  • 3,590
  • 2
  • 22
  • 31
1

In my opinion, what is missing is a distributed programming platform...a platform that makes application programming over distributed systems as transparent as non-distributed programming is now.

axilmar
  • 836
  • 1
  • 13
  • 17
  • 1
    Or, for distributed-memory parallel programming, Charm++ – Phil Miller Nov 23 '10 at 17:37
  • I am afraid Erlang doesn't cut it. Although Erlang provides the actual mechanisms for distributed programming, it does not provide the infrastructure for easy distributing Erlang applications. For example (taken from the Erlang documentation), in order to have two Erlang processes in different computers communicate, you have to install the same magic cookie onto them. I didn't also see anything about encryption, automatic updating of modules, etc. Personally, I envision a system which provides the best possible automation; the only thing a user has to do is download the program's executable... – axilmar Nov 24 '10 at 14:17
  • @axilmar: I wouldn't agree with all your points about Erlang, but I would say it would need a slightly better security infrastructure. I agree about the encryption, but having to install the same magic cookie is actually a lightweight security mechanism: I would prefer a stronger (and more difficult) way for having two Erlang processes in different computers such as ssh. Automatic updating of modules (hot swapping) is good in my opinion actually. But you have to strike a balance between security and usability: downloading executables and running them is a recipe for a security disaster. – Muhammad Alkarouri Jan 03 '12 at 17:40
1

Isn't it a bit early to work on Tools when we don't even agree on a platform? We have several flavors of actor models, virtual shared memory, UMA, NUMA, synchronous dataflow, tagged token dataflow, multi-hierchical memory vector processors, clusters, message passing mesh or network-on-a-chip, PGAS, DGAS, etc.

Feel free to add more.

To contribute: I find myself writing a lot of distributed programs by constructing a DAG, which gets transformed into platform-specific code. Every platform optimization is a different kind of transformation rules on this DAG. You can see the same happening in Microsoft's Accelerator and Dryad, Intel's Concurrent Collections, MIT's StreaMIT, etc. A language-agnostic library that collects all these DAG transformations would save re-inventing the wheel every time.

Beef
  • 891
  • 4
  • 9
1

You can also take a look at Akka: http://akka.io

Jonas Bonér
  • 401
  • 4
  • 5
0

Let me notify those who've favourited this question by pointing to the Greg logger - http://code.google.com/p/greg . It is the distributed logger with a high-precision global time axis that I've talked about in the other answer in this thread.

jkff
  • 17,623
  • 5
  • 53
  • 85
0

Apart from the mentioned tool for "visualizing the behavior of many concurrent multi-stage processes" (splot), I've also written "tplot" which is appropriate for displaying quantitative patterns in logs.

A large presentation about both tools, with lots of pretty pictures here.

jkff
  • 17,623
  • 5
  • 53
  • 85