17

We develop a distributed system built from components implemented in different programming languages (C++, C# and Python) and communicating one with another across a network. All the components in the system operate with the same business concepts and communicate one with another also in terms of these concepts.

As a results we heavily struggle with the following two challenges:

  1. Keeping the representation of our business concepts in these three languages in sync
  2. Serialization / deserialization of our business concepts across these languages

A naive solution for this problem would be just to define the same data structures (and the serialization code) three times (for C++, C# and Python).

Unfortunately, this solution has serious drawbacks:

  • It creates a lot of “code duplication”
  • It requires a huge amount of cross-language integration tests to keep everything in sync

Another solution we considered is based on the frameworks like ProtoBufs or Thrift. These frameworks have an internal language, in which the business concepts are defined, and then the representation of these concepts in C++, C# and Python (together with the serialization logic) is auto-generated by these frameworks.

While this solution doesn’t have the above problems, it has another drawback: the code generated by these frameworks couples together the data structures representing the underlying business concepts and the code needed to serialize/deserialize these data-structures.

We feel that this pollutes our code base – any code in our system that uses these auto-generated classes is now “familiar” with this serialization/deserialization logic (a serious abstraction leak).

We can work around it by wrapping the auto-generated code by our classes / interfaces, but this returns us back to the drawbacks of the naive solution.

Can anyone recommend a solution that gets around the described problems?

Henk Langeveld
  • 8,088
  • 1
  • 43
  • 57
Lev
  • 727
  • 5
  • 17
  • Have you tried CSLA? http://lhotka.net/cslanet/ – Sam Axe Aug 03 '12 at 20:10
  • Have you considered embedding Python in the other two? – Ignacio Vazquez-Abrams Aug 03 '12 at 20:11
  • 2
    A system the business logic and concepts of which need to stay in sync across 3 different languages adopts an essentially fragile model of operation. Perhaps try putting some efforts into restructuring you system for one of the languages now, just to avoid much more work in the future when you realize on your own that maintaining and developing the system any further would require dropping 2 of the 3 languages. – Desmond Hume Aug 03 '12 at 20:36
  • 2
    Xml for data exchange with XSDs defined. Auto code generation for data access entities based on XSD for all languages. Only one point of change - XSD :) – Ankush Aug 03 '12 at 20:58
  • 2
    What do you mean by "business concepts"? Data structures? Business logic? Something els? – Sergei Rogovtcev Aug 03 '12 at 21:11
  • What concerns you about coupling the structures and the serialization code in a protobufs-based solution? What's the practical impact of the "abstraction leak"? It seems like the least bad approach here. – Russell Borogove Aug 03 '12 at 21:31
  • @Boo, can you elaborate - how CSLA helps with the above problem? – Lev Aug 04 '12 at 08:44
  • @DesmondHume: I agree with you in general. But I don't see how to change it now - C++ used for a real time component in our system (data acquisition), C# for distributed processing and Python is used to implement a test driver for acceptance tests (we use Robot framework for acceptance tests) – Lev Aug 04 '12 at 08:46
  • 1
    @SergRogovtsev: by business concepts I mean data structures, yes. The same concept is represented by three data structures (because we have 3 languages) that's why I differentiate between the concept itself and its representation in of these languages by a data structure. – Lev Aug 04 '12 at 08:50
  • @RussellBorogove: for instance, in C++ code it means that any compilation unit that uses these data structures should include headers only relevant for serialization, which is ugly and increases the compilation time. And yes, I agree, currently it seems like a less evil for us. – Lev Aug 04 '12 at 08:52

7 Answers7

5

Lev, you may want to look at ICE. It provides object-oriented IDL with mapping to all the languages you use (C++, Python, .NET (all .NET languages, not just C# as far as I understand)). Although ICE is a middle-ware framework, you don't have to follow all its policies.

Specifically in your situation you may want to define the interfaces of your components in ICE IDL and maintain them as part of the code. You can then generate code as part of your build routine and work from there. Or you can use more of the power that ICE gives you.

ICE support C++ STL data structures and it supports inheritance, hence it should give you sufficiently powerful formalism to build your system gradually over time with good degree of maintainability.

Boris Liberman
  • 105
  • 1
  • 10
3

Well, once upon a time MS tried to solve this with IDL. Well, actually it tried to solve a bit more than defining data structures, but, anyway, that's all in the past because no one in their right mind would go the COM route these days.

One option to look at is SWIG which is supposed to be able to port data structures as well as actual invocation across languages. I haven't done this myself but there's a chance it won't couple the serialization and data-structures so tightly as protobufs.

However, you should really consider whether the aforementioned coupling is such a bad thing after all. What would be the ideal solution for you? Supposedly it's something that does two things: it generates compatible data structures across multiple languages based on one definition and it also provides the serialization code to stitch them together - but in a separate abstraction layer. The idea being that if one day you decide to use a different serialization method you could just switch out that layer without having to redefine all your data structures. So consider that - how realistic is it really to expect to some day switch out only the serialization code without touching the interfaces at all? In most cases the serialization format is the most permanent design choice, since you usually have issues with backwards compatibility, etc. - so how much are you willing to pay right now in development cost in order to be able to theoretically pull that off in the future?

Now let's assume for a second that such a tool exists which separates data structure generation from serialization. And lets say that after 2 years you decide you need a completely different serialization method. Unless this tool also supports plugable serialization formats you would need to develop that layer anyway in order to stitch your existing structures to the new serialization solution - and that's about as much work as just choosing a new package altogether. So the only real viable solution that would answer your requirements is something that not only support data type definition and code generation across all your languages, and not only be serialization agnostic, but would also have ready made implementation of that future serialization format you would want to switch to - because if it's only agnostic to the serialization format it means you'd still have the task of implementing it on your own - in all languages - which isn't really less work than redefining some data structures.

So my point is that there's a reason serialization and data type definition so often go together - it's simply the most common use case. I would take a long look at what exactly you wish to be able to achieve using the abstraction level you require, think of how much work developing such a solution would entail and if it's worth it. I'm certain that are tools that do this, btw - just probably the expensive proprietary kind that cost $10k per license - the same argument applies there in my opinion - it's probably just over engineering.

Assaf Lavie
  • 73,079
  • 34
  • 148
  • 203
  • @Assaf, I agree with everything you said about the difficulty to replace the serialization framework and the fact that it's a natural thing that data structures and s11n are coupled, but using the code generated by ProtoBuf all over the system still smells bad to me because it means placing the s11n framework in the center of our system. It's a domain driven design anti-pattern. It's very much like deciding on some data base and then designing all the system around it or like deciding that the whole system should do math calculations on quint16 just because the UI was developed in QT. – Lev Aug 04 '12 at 15:57
  • 1
    Don't decide by smell. Decide by how much each option would cost. Redefining some data structure in a new language isn't that hard... Writing your own serialization impl and code generator cross languages is proably harder. – Assaf Lavie Aug 04 '12 at 16:53
2

All the components in the system operate with the same business concepts and communicate one with another also in terms of these concepts.

When I got you right, you have split up your system in different parts communicating by well-defined interfaces. But your interfaces share data structures you call "business concepts" (hard to understand without seeing an example), and since those interfaces have to build for all of your three languages, you have problems keeping them "in-sync".

When keeping interfaces in sync gets a problem, then it seems obvious that your interfaces are too broad. There are different possible reasons for that, with different solutions.

Possible Reason 1 - you overgeneralized your interface concept. If that's the case, redesign here: throw generalization over board and create interfaces which are only as broad as they have to be.

Possible reason 2: parts written in different languages are not dealing with separate business cases, you may have a "horizontal" partition between them, but not a vertical. If that's the case, you cannot avoid the broadness of your interfaces.

Code generation may be the right approach here if reason 2 is your problem. If existing code generators don't suffer your needs, why don't you just write your own? Define the interfaces for example as classes in C#, introduce some meta attributes and use reflection in your code generator to extract the information again when generating the according C++, Python and also the "real-to-be-used" C# code. If you need different variants with or without serialization, generate them too. A working generator should not be more effort than a couple of days (YMMV depending on your requirements).

Doc Brown
  • 19,739
  • 7
  • 52
  • 88
  • Thanks, Doc. My problem is (2). – Lev Aug 04 '12 at 10:24
  • Can you, please, elaborate how do you propose to deal with the serialization? – Lev Aug 04 '12 at 10:25
  • 1
    @Lev: when you write your own code generator, you will have to generate the serialization/deserialization code for all three languages by yourself. If you think that's too much effort, you could use the generator to produce a Thrift definition file on one hand (and run Thrift afterwards), and the wrapper classes you mentioned in your post on the other hand (instead of building them manually). – Doc Brown Aug 04 '12 at 10:33
  • Doc, this sounds doable. But I have a strange feeling here. It's a 21th century now and almost any serious system is heterogeneous. But it sounds that there is no canonical solution for this problem. – Lev Aug 04 '12 at 10:36
  • @Lev: "heterogeneous" often means different things in different environments; when connecting different components written in different languages together, the reason is often that those components are already there, dealing with different business cases, and you need only lean interfaces between them. The interesting question here is why you are going to implement a system with 3 "horizontal" layers in 3 different languages, which results in broad interfaces. – Doc Brown Aug 04 '12 at 14:19
  • One part of our system is developed in C++, because it does real time acquisition in very high rate (an embedded system). Other part is C#, since we want to take advantage of high-level language where we can. Writing C++ where you don't really need has a clear disadvantages (C# development is much faster and the frameworks available around it are much wider) Python is used mainly in the testability layer. Strongly speaking it's not a part of our system. It's a test driver for Robot Framework (an acceptance testing framework) – Lev Aug 04 '12 at 16:00
  • @Lev: that sounds to me like you could replace your Python part completely by C#, and keep that C++ so small that you don't need broad interfaces. But honestly, I don't know your system and I guess you have good reasons why the architecture is like it is. So if you think the solution presented here is the way you will try to go, don't forget it to mark it as "the" answer. – Doc Brown Aug 04 '12 at 20:35
1

I agree with Tristan Reid (wrapping the business logic). Actually, some months ago I faced the same problem, and then I incidentally discovered the book "The Art Of Unix Programming" (freely available online). What grabbed my attention was the philosophy of separating policy from mechanism (i.e. interfaces from engines). Modern programming environments such as the NET platform try to integrate everything under a single domain. In those days I was asked for developing a WEB application that had to satisfy the following requirements:

  1. It had to be easily adapted to future trends of User Interfaces without having to change the core algorithms.

  2. It had to be accessible by means of different interfaces: web, command line and desktop GUI.

  3. It had to run on Windows and Linux.

I bet for developing the mechanism (engines) completely in C/C++ and using native OS libraries (POSIX or WinAPI) and good open source libraries (postgresql, xml, etc...). I developed the engine modules as command-line programs and I eventually implemented 2 interfaces: web (with PHP+JQuery framework) and desktop (NET framework). Both interfaces had nothing to do with the mechanisms: they simply launched the core modules executables by calling functions such as CreateProcess() in Windows, or fork() in UNIX, and used pipes to monitor their processes.

I'm not saying UNIX Programming Philosophy is good for all purposes, but I am applying it from then with good results and maybe it will work for you too. Choose a language for implementing the mechanism and then use another that makes interface design easy.

Claudi
  • 5,224
  • 17
  • 30
  • 3
    I agree with the philosophy you described, but how does it help me with my question? – Lev Aug 04 '12 at 08:54
  • 1
    Well, I have considered (maybe I'm wrong) that you are using 3 languages for implementation issues. I mean, you may use C++ for fast execution/portability issues, C# for interface issues, and Python for scripting. The question is: do all parts need to know everything about all data structures? I mean, maybe the engines need to know X,Y and Z of a structure called "Person", but the interfaces may only need to know about X. Of course you'll always need some sort of redundancy, but the goal is to minimize it. I'm sorry if I can't be more specific, could you give us further details? – Claudi Aug 04 '12 at 09:26
0

You can wrap your business logic as a web service and call it from all three languages - just a single implementation.

Tristan Reid
  • 5,844
  • 2
  • 26
  • 31
  • I'm less worried about business logic (algorithms, code) - more about business concepts (data structures). I want to keep the data structures defined in different languages, but representing the same concept in sync. How does the web-service help me with it? – Lev Aug 03 '12 at 21:14
  • "Web service" implies the use of HTML protocol, while the OP might have no reasons to deal with HTML. "Network service" would be more appropriate here. – Desmond Hume Aug 03 '12 at 21:14
  • @Lev With each of these 3 languages you can generate "stub code" automatically from a service definition, which contains the data structure and interfaces. This stub generation can be automated as part of your build process. You won't need to re-write any language-specific code unless you make a breaking change in the data structure interface. This will also handle your serialization problem, as the data structure definition you create in your service definition will tell each stub to create a serializeable object representing the data in the definition. – Tristan Reid Aug 07 '12 at 00:19
0

You could model these data structures using tools like a UML modeler (Enterprise Architect comes to mind as it can generate code for all 3.) and then generate code for each language directly from the model.

Though I would look closely at a previous comment about using XSD.

Tim Hoffman
  • 12,976
  • 1
  • 17
  • 29
  • Tim, we've used Enterprise Architect for a while and I have to say that this use slows down the development heavily - using this tool for day-2-day programming seems like a theoretical solution only. – Lev Aug 04 '12 at 08:57
  • Interesing. I have a web based project running on appengine where a significant portion of the code is modelled and generated from Enterprise Architect. All of the data model, URL mapping, views, and stubs for class methods is generated. There is about 40 main entities in the datastore. So no new entities, views or routes are added to the system without modelling it first. Having said that, it is a small team working on the project. Actual code in public methods and private methods are not modelled. Code generation preserves all custom code. (This is all python) – Tim Hoffman Aug 04 '12 at 09:37
0

I would accomplish that by using some kind of meta-information about your domain entities (either XML or DSL, depending on complexity) and then go for code generation for each language. That would reduce (manual) code duplication.

Sergei Rogovtcev
  • 5,804
  • 2
  • 22
  • 35