1

I can't be sure where the error lies, but I'm attempting to pass messages between a Python client and a Chapel server. The client code is

import zmq

context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect("tcp://localhost:5555")

for request in range(10):
    print("Sending request %s ..." % request)
    socket.send(str("Yo"))
    message = socket.recv()
    print("OMG!! He said %s" % message)

And the Chapel server is

use ZMQ;
var context: Context;
var socket = context.socket(ZMQ.REP);
socket.bind("tcp://*:5555");

while ( 1 < 2) {
  var msg = socket.recv(string);
  socket.recv(string);
  writeln("got something");
  socket.send("back from chapel");
}

The message seems common, but I don't truly understand it.

server.chpl:7: error: halt reached - Error in Socket.recv(string): Operation cannot be accomplished in current state

I think I am sending/receiving on both sides. The original Chapel example on the Chapel site worked fine, but I'm having trouble modifying it.

Update

With the help of the Chapel team on this thread, this now works.

client.py

import zmq

context = zmq.Context()
socket = context.socket(zmq.REQ)
socket.connect("tcp://localhost:5555")

for request in range(10):
    message = "Hello %i from Python" % request
    print("[Python] Sending request: %s" % message)
    socket.send_string(message)
    message = socket.recv_string()
    print("[Python] Received response: %s" % message)

server.chpl

use ZMQ;

var context: Context;
var socket = context.socket(ZMQ.REP);
socket.bind("tcp://*:5555");

for i in 0..#10 {
  var msg = socket.recv(string);
  writeln("[Chapel] Received message: ", msg);
  socket.send("Hello %i from Chapel".format(i));
}
halfer
  • 19,824
  • 17
  • 99
  • 186
Brian Dolan
  • 3,086
  • 2
  • 24
  • 35
  • 1
    As proposed below, my suspects are the **double, adjacent, calls to `.recv()` method** on a `REP`-instance on the Chapel-side, which principally has to collide with the dFSA-hard-wired logic and throws the distributed-system into an un-salvageable mutual-deadlock state. For details feel free to read the post below. – user3666197 Aug 12 '17 at 21:21

2 Answers2

4

@user3666197's answer gave a good discussion of the ZeroMQ state machine and I think the problem lies with how the Chapel ZMQ module serializes and transmits strings.

The Socket.send(string) and Socket.recv(string) methods in Chapel serialize a string by sending two messages. This was intended to match the pattern in the ZeroMQ Guide's "Minor Note on Strings", however, as implemented, this serialization scheme is incorrect and incompatible with certain ZeroMQ socket patterns.

To send a string, Chapel sends one multi-part message with two calls to zmq_send(): the first of string size with the ZMQ_SNDMORE flag, followed the second with the byte buffer; receiving works similarly. That means that your one call to socket.recv(string) was actually making two back-to-back calls to zmq_recv() under the hood. With the REQ/REP pattern, those two back-to-back zmq_recv() calls put the ZeroMQ state machine into an invalid state, hence the error message.

This is definitely a bug with Chapel's ZMQ module.

For reference, I'm the author of the (definitely-not-bug-free) Chapel ZMQ module.

Nick
  • 51
  • 3
  • Would you be able to provide a working example? The one in the documentation takes only one message and quits. Thanks! – Brian Dolan Aug 14 '17 at 21:27
  • Nick, are you sure to state that "**to send a string**" Chapel.ZMQ module sends *(cit.)* "**one message with the string size followed by another message with the byte buffer**" .OR. sends **one-multipart-message** ( internally structured as noted: 1st-part:(int) + 2nd-part:(byte[]) ) -- this is a cardinal difference for the user-side program design. Could you kindly clarify your statement above? Thanks. – user3666197 Aug 14 '17 at 22:01
  • The issue gets even worse, as **`PUB/SUB` Scalable Formal Communication Pattern expects left-side exact-match(es) for topic-filters to work as documented in ZeroMQ API**. So any "hidden-message-masquerades" ( as noted both here, in discussion, and in the source-code remarks about hidden string manipulations, decided as a hard-wired message-re-structuring **actually principally de-rail this `PUB/SUB` core-feature**. Are you aware of this or have you enforced a full-scale cross-compatibility testing among ZeroMQ API ( v2.11, v3.x, v4.x ) & non-Chapel `{PUB|SUB}` to confirm this as being solved? – user3666197 Aug 14 '17 at 22:08
  • 1
    The following GitHub issue (opened by @Nick as a result of this discussion) relates to this answer: https://github.com/chapel-lang/chapel/issues/7008 – Brad Aug 14 '17 at 23:40
  • 1
    @user3666197 I added links to the Chapel implementation for `Socket.send/recv(string)` in an edit to my answer. To answer you directly, the current implementation sends a string as a multi-part message with two calls to `zmq_send()`, the first of which adds the `ZMQ_SNDMORE` flag. – Nick Aug 15 '17 at 00:55
  • @user3666197 No, there has not been "full-scale cross-compatibility testing" with the Chapel ZMQ module. The Chapel documentation is clear that the ZMQ module only targets the v4.x API. I appreciate that you've pointed out an implementation issue with `PUB`/`SUB` filtering and the module's approach to serialization. I'll note that in the GitHub issue. – Nick Aug 15 '17 at 00:58
  • @BrianDolan A working example of Python and Chapel communicating over ZeroMQ? With which socket pattern (e.g., `REQ`/`REP`)? With what type of data (e.g., strings)? – Nick Aug 15 '17 at 01:03
  • @Nick discussion grew pretty far from [Chapel]. Anyway, the deadly-sin is in the serialisation. Do not rather cite a ZeroMQ wire-level remarks on transporting strings on the internal-level-of-detail and define the Chapel-language-bindings not to violate the ZeroMQ:RFC specifications. Never attempt to use "easy"-shortcuts - as to send-twice across API, instead of proper Chapel-language-binding algorithmisation of the API-level behaviour. Better and robust implementation makes sense. Any opposite not. (+ Be warned, that v4.x API creeps on the ZeroMQ side, so not sure if safe to design against. ) – user3666197 Aug 15 '17 at 01:12
  • @Nick For reference, my intent is to use ZeroMQ to pass JSON between Python and Chapel. More specifically, I want to pass sparse arrays and matrices, but I figured I'd use JSON to describe them on either side. – Brian Dolan Aug 15 '17 at 11:29
3

Until resolved and re-confirmed by team, kindly test any ZMQ module services just with int payloads and possibly avoid PUB/SUB archetypes ( due to string-matching pending issue ).

As @Nick has recently disclosed here, there is a way yet to go to make ZMQ services to meet the ZeroMQ API compliance and to fully open a cross-compatible gate to heterogenous distributed-systems:

To send a string, Chapel sends one message with the string size followed by another message with the byte buffer; receiving works similarly.

That means that your one call to <aSocket>.recv( string ) was actually making two back-to-back calls to zmq_recv() under the hood. With the REQ/REP pattern, those two back-to-back zmq_recv() calls put the ZeroMQ state machine into an invalid state, hence the error message.

This is definitely a bug with Chapel's ZMQ module.


A few steps to shed more light on the scene:

Let me propose a few measures to take, before diagnosing the root-cause. ZeroMQ is a quite powerful framework, where one could hardly pick a harder ( and more fragile ) messaging archetype, than the REQ/REP.

The internal Finite-State-Automata ( in fact, distributed-FSA ) are both blocking ( by-design, to enforce a pendulum-like message passing among the connected peers ( that need not be just the first 2 ) so that a SEQ of [A]-.send()-.recv()-.send()-.recv()-... on one side [A] matches the SEQ of [B]-.recv()-.send()-.recv()-... ) and this dFSA also have a principally un-salvageable mutual deadlock, if for any reason both sides enter into a wait-state, where both [A] and [B] expect to receive a next message from the opposite side of the channel.

This said, my advice would be to first move into a simplest possible test - using a pair of unrestricted, simplex channels ( be it [A]PUSH/[B]PULL + [B]PUSH/[A]PULL, or a bit more complicated scheme with PUB/SUB ).

Not going into a setup for a fully meshed, multi-Agent infrastructure, but a simplified version of this ( without a need for and intention to use the ROUTER/DEALER channels, but perhaps duplicating ( reversed ) PUSH/PULL-s if extending the mock-up scheme ):

enter image description here

More efforts will yet to be spent on implied limitations, arising from current implementation constraints:

In Chapel, sending or receiving messages on a Socket uses multipart messages and the Reflection module to serialize primitive and user-defined data types whenever possible. Currently, the ZMQ module serializes primitive numeric types, strings, and records composed of these types. Strings are encoded as a length (as int) followed by the character array (in bytes).

This makes some more issues on both sides and some tweaking ought be expected, if these remarks are not just wire-level internalities and extend to the top-level ZeroMQ messaging/signalling-layer ( ref. details for managing subscriptions, where ZeroMQ topic-filter matching is based on a left-side exact-match against the message received, et al ).


The side enjoys a much wider freedom of design:

#
# python
# #########

import time
import zmq; context = zmq.Context()

print( "INF: This Agent uses ZeroMQ v.{0:}".format( zmq.__version__ ) )

dataAB = context.socket( zmq.REQ )
dataAB.setsockopt( zmq.LINGER, 0 )        # ( a must in pre v4.0+ )
dataAB.connect( "tcp://localhost:5555" )

heartB = context.socket( zmq.SUB )
heartB.setsockopt( zmq.LINGER,   0 )      # ( a must in pre v4.0+ )
heartB.setsockopt( zmq.CONFLATE, 0 )      # ( ignore history, keep just last )

heartB.connect( "tcp://localhost:6666" )
heartB.setsockopt( zmq.SUBSCRIBE, "[chapel2python.HB]" )
heartB.setsockopt( zmq.SUBSCRIBE, "" )    # in case [Chapel] complicates serialisation
# -------------------------------------------------------------------    
while ( True ):
      pass;             print( "INF: waiting for a [Chapel] HeartBeat-Message" )
      hbIN = heartB.recv( zmq.NOBLOCK );
      if len( hbIN ) > 0:
         pass;          print( "ACK: [Chapel] Heart-Beat-Message .recv()-ed" )
         break
      else:
         time.sleep( 0.5 )
# -------------------------------------------------------------------
for request in range(10):
    pass;               print( "INF: Sending a request %s to [Chapel] ..." % request )
    dataAB.send( str( "Yo" ) )
    pass;               print( "INF: a blocking .recv(), [Chapel] is to answer ..." )
    message = dataAB.recv()
    pass;               print( "INF: [Chapel] said %s" % message )
# -------------------------------------------------------------------
dataAB.close()
heartB.close()
context.term()
# -------------------------------------------------------------------

Some further try:/except:/finally: constructs ought be put in service for KeyboardInterrupt-s from infinite while()-loops et al, but for clarity, these were omitted here.


On the side we will do our best to keep pace with the API, as-is:

Documentation, as-is, does not help yet to decide, whether user-code has an option to control, if a call to .send() / .recv() method is implicitly always blocking or not, while your code assumes it is being run in a blocking-mode ( which I always and principally strongly discourage for any distributed-system design, blocking is a poor practice - more on this here ).

While the C-level call zmq_send() may be a blocking call (depending on the socket type and flag arguments), it is desirable that a semantically-blocking call to Socket.send() allow other Chapel tasks to be scheduled on the OS thread as supported by the tasking layer. Internally, the ZMQ module uses non-blocking calls to zmq_send() and zmq_recv() to transfer data, and yields to the tasking layer via chpl_task_yield() when the call would otherwise block.

Source

use ZMQ;
use Reflection;

var context: Context;
var dataBA = context.socket( ZMQ.REP ),
    heartB = context.socket( ZMQ.PUB );
var WAITms = 0;                             // setup as explicit int
    dataBA.setsockopt( ZMQ.LINGER, WAITms );// a must
    heartB.setsockopt( ZMQ.LINGER, WAITms );// a preventive step

    dataBA.bind( "tcp://*:5555" );          // may reverse .bind()/.connect()

    writeln( "INF: This Agent uses ZeroMQ v.", ZMQ.version() );

// /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
config var   MAX_LOOPS  = 120;              //  --MAX_LOOPS = 10 set on cmdline
       var            i =   0;

while ( i < MAX_LOOPS ) {
 // --------------------------------------- // .send HeartBeat
    heartB.send( "[chapel2python.HB]" );
    i += 1;
    writeln( "INF: Sent HeartBeat # ", i );
 // --------------------------------------- // .send HeartBeat

    var msg = dataBA.recv( string );        // .recv() from python
 // - - - - - - - - - - - - - - - - - - - - // - - - - -WILL-[BLOCK]!!!
                                            //          ( ref. src )
    writeln( "INF: [Chapel] got: ",
              getField( msg, 1 )
              );

    dataBA.send( "back from chapel" );      // .send() to   python
}
writeln( "INF: MAX_LOOPS were exhausted,",
             " will exit-{} & .close()",
             " channels' sockets before",
             " [Chapel] exits to system."
             );
// /\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
dataBA.close( WAITms );                     // explicit graceful termination
heartB.close( WAITms );                     // explicit graceful termination
context.deinit();                           // explicit context  termination
                                            //       as not yet sure
                                            //       on auto-termination
                                            //       warranties
halfer
  • 19,824
  • 17
  • 99
  • 186
user3666197
  • 1
  • 6
  • 50
  • 92
  • I expect this will be the full answer, but I'd like to digest it and try it for a few hours before I accept it. Thanks. – Brian Dolan Aug 12 '17 at 21:10
  • Sure, kindly mind a note about serialisation. Might need to go as deep as to use just `int`-message payloads to start the mock-up to work, before moving into more complex data-types. Enjoy the worlds of ZeroMQ + Chapel combined! ( Also **would be great, if you could also find a second and reply to the large sparse-arrays, with a post-update with details on `repr( I )` as you promised** there ). – user3666197 Aug 12 '17 at 21:16
  • HA! Sorry, I broke that app trying to fix it and have moved off. If I get it running again I WILL post it. It's not a secret or anything :) – Brian Dolan Aug 12 '17 at 23:50
  • Glad it helped. – user3666197 Aug 13 '17 at 00:05
  • This line apparently won't compile for Chapel 1.15 (I'm extremely new to 0MQ) `dataBA.setsockopt( ZMQ_LINGER, 0 ); ` – Brian Dolan Aug 13 '17 at 01:45
  • Error is `server.chpl:7: error: 'ZMQ_LINGER' undeclared (first use this function)` Replacing it with `ZMQ.LINGER` compiles, but gives runtime error `error: halt reached - Error in Socket.setsockopt(): Invalid argument` – Brian Dolan Aug 13 '17 at 02:03
  • 1
    Mea culpa -- obviously `ZMQ.LINGER` >>>http://chapel.cray.com/docs/1.15/modules/packages/ZMQ.html#ZMQ.LINGER however the failed `Socket.setsockotp( option: int, val: ?T )` will need further investigation in source ( as type-check was correct & syntax met ). **ATM, Chapel code can live without this option set -- i.e. comment it out** -- , as it can have this option set in graceful termination section ( inside a call to `Socket.close( linger: int = unset );` so the `Context.term()` ( non-explicit in Chapel ) will have free hands to terminate and release resources back to system. Hope it helps. – user3666197 Aug 13 '17 at 04:52
  • For reference, I tried this and got three errors. line 43 `server.chpl:43: error: direct calls to deinit() are not allowed`, line 13 `server.chpl:13: error: illegal tuple indexing expression`. If I comment them out, it compiles and I get the runtime error `server.chpl:8: error: halt reached - Error in Socket.setsockopt(): Invalid argument`. – Brian Dolan Aug 14 '17 at 12:20
  • 2
    For the line 13 error, the `version` query doesn't take parentheses—removing them should fix that problem without commenting it out. For the line 43 error, this call should not be necessary (and the compiler complains, can't be made manually). I've asked the ZMQ module author to catch up with this thread for the other issues. – Brad Aug 14 '17 at 18:23
  • @Brad thanks for stepping in, the explicit Context-instance termination is of an underestimated importance and was a thing that could hang-up the whole infrastructure in early ZeroMQ versions 2.11+. It is a fair move to allow application programmers keep these preventive-measures under explicit code controls, or to provide an explicit and absolute warranty from the module engineering, that any such hungup(s) will never happen under any possible circumstances. Who would like to ever have to resort to just reboot the whole HPC-infrastructure polygon frozen this way, wouldn't he? – user3666197 Aug 14 '17 at 18:59
  • Hi @user3666197: This could be useful feedback to file as a GitHub feature request issue -- request for an explicit terminate() method in ZMQ (and/or "ability to call deinit() routines explicitly more generally). We discussed supporting the latter a bit but decided against it, so the approach that an object designer would typically take in Chapel to get this pattern would be to support an explicit destroy()/terminate() method on their object, a bool saying whether or not it had been invoked, and then have the deinit() method invoke it in the event that nobody else had. – Brad Aug 14 '17 at 19:49
  • The core issue is not the user's will as a nice to have feature, but a preventive-termination motivated steps, not deferred to potentially present exception-handling procedures & creating a certainty that all the previously allocated resources got indeed released back to the system before the code-execution halts.(In this sense a bool-flag masquerade need not represent a fail-safe (non-deferrable) gracefull termination --- Ms. Margaret Hamilton, Apollo Lunar Module Control System software designer, taught us The Lesson on robustness & collision-avoidance w thresholded responses & terminations – user3666197 Aug 14 '17 at 20:10
  • @Brad For the line 13: The [Chapel] **`ZMQ`** module documentation defines **`proc version: ( int, int, int )`** coherently with the ZeroMQ API. Did I miss something when using the call for such procedure as `ZMQ.version()`, while you advised above to remove parenthesis? If the [Chapel] **`proc`** is actually just lexically faked inside a `zmq.h` file, as a something like a `#define version ( , , )`, why it was not declared, coherently with other use-cases, as a [Chapel] **`const`** instead? – user3666197 Aug 16 '17 at 06:52
  • Hi @user3666197: This is subtle, but note the `:` after `version` and before the `(int, int, int)`. This colon says "I'm about to tell you what the return type of this procedure is" and then the parenthesized ints that follow say "I return a 3-tuple of integers". So `version` is what we call a parentheses-less function in Chapel — one that takes no arguments and acts like a field yet one implemented using code rather than memory. Arguably, this may have been less confusing if it had been written `proc version: 3*int` to avoid the confusion between formal argument list and tuple return type. – Brad Aug 16 '17 at 16:47
  • [Chapel] on -IDE got recently upgraded, so can confirm via TEST [CHAPEL]ZMQ.E that now yields "[Chapel] uses Version: major: 4 minor: 1 patch: 6." – user3666197 Aug 16 '17 at 18:58
  • A side-note on parentheses-less function call-interface: What is such an immense benefit of this very specific syntax-constructor, to have it implemented into the language? In other words, what is the 2nd, explicit reason **not to support** a reasonable expectation of using a call to a parameter-less procedure as **`aTuple = ZMQ.version();`** once the "naked" ( parentheses-less ) syntax bears an inherent contradiction, once a call to a documented **`proc`** suddenly starts to appear indistinguishible from a reference to a **`const`** or a **`var`**? Did I miss some pyramidal benefit in this? – user3666197 Aug 16 '17 at 19:07