1

Summary

I have an initiator with quickfix 50sp2 on a FIXT1.1 transport layer. As I see in the log messages, there is no connection problem. Heartbeats are reaching to each side. But randomly my initiator sends test request to acceptor even if it gets heartbeat as expected. Acceptor response to test request with a heartbeat including TestReqID. After that my initiator sends logon messages. Acceptor sends logon too and then loop starts. If I restart the whole application, initiator can logon and gets heartbeat.

Expectation

After acceptor sends heartbeat with specific TestReqID, I'm expecting that getting and sending heartbeats. Instead, initiator is sending logon message and not loggin' to the acceptor.

Event.Log

20220719-01:01:05: Initiated logon request
20220719-01:01:05: Setting DefaultApplVerID (1137=9) from Logon
20220719-01:01:05: Logon contains ResetSeqNumFlag=Y, resetting sequence numbers to 1
20220719-01:01:05: Received logon
20220719-09:00:17: Sent test request TEST
20220719-09:00:29: Disconnecting: Timed out waiting for heartbeat
20220719-09:00:33: Pending connection not established after 4001 ms.
20220719-09:00:36: Pending connection not established after 7002 ms.
20220719-09:00:36: MINA session created: local=/hosti:porti, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/hosta:porta
20220719-09:00:37: Initiated logon request
20220719-09:00:37: Setting DefaultApplVerID (1137=9) from Logon
20220719-09:00:48: Disconnecting: Timed out waiting for logon response
20220719-09:01:29: MINA session created: local=/hosti:porti, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/hosta:porta
20220719-09:01:30: Initiated logon request
20220719-09:01:30: Setting DefaultApplVerID (1137=9) from Logon
20220719-09:01:40: Disconnecting: Timed out waiting for logon response
20220719-09:02:29: MINA session created: local=/hosti:porti, class org.apache.mina.transport.socket.nio.NioSocketSession, remote=/hosta:porta

Messages.log

8=FIXT.1.19=00005235=049=ACCEPTOR56=INITIATOR34=98952=20220719-08:58:46.53810=079
8=FIXT.1.19=6135=034=95749=INITIATOR50=SNDRSUBID52=20220719-08:59:05.15856=ACCEPTOR10=107
8=FIXT.1.19=00041535=849=ACCEPTOR56=INITIATOR34=99052=20220719-08:59:16.22737=ORDID11=NONE453=3448=GL-P447=D452=28448=INITIATOR447=D452=1448=KL305447=D452=12880=TRDMTCHID17=1234567150=F39=221100=1.000000021101=1.000000055=EURUSD48=17107122=M54=138=1.000000040=244=1.000000059=0528=P32=1.000000031=1.0000000151=014=1.00000006=060=20220719-08:59:16.226797=Y10=090
8=FIXT.1.19=00052135=AE49=ACCEPTOR56=INITIATOR34=99152=20220719-08:59:16.23220008=1.000000021100=1.000000021102=123420017=020018=1.0000000730=1.0000000571=90063491003=123456:123A487=0856=0829=1001855=1880=TRDMTCHID570=N55=EURUSD48=17107122=M38=1.000000032=1.000000031=1.000000075=20220719715=2022071960=20220719-08:59:16.000573=0552=154=137=ORDID453=3448=INITIATOR447=D452=1448=GL-P447=D452=28448=KL305447=D452=12528=P1057=N797=Y1703=11704=1000000010=054
8=FIXT.1.19=6135=034=95849=INITIATOR50=SNDRSUBID52=20220719-08:59:35.15856=ACCEPTOR10=111
8=FIXT.1.19=00005235=049=ACCEPTOR56=INITIATOR34=99252=20220719-08:59:46.59010=072
8=FIXT.1.19=6135=034=95949=INITIATOR50=SNDRSUBID52=20220719-09:00:05.15856=ACCEPTOR10=096
8=FIXT.1.19=00005235=049=ACCEPTOR56=INITIATOR34=99352=20220719-09:00:16.65910=063
8=FIXT.1.19=7035=134=96049=INITIATOR50=SNDRSUBID52=20220719-09:00:17.15856=ACCEPTOR112=TEST10=110
8=FIXT.1.19=00006135=049=ACCEPTOR56=INITIATOR34=99452=20220719-09:00:17.163112=TEST10=073
8=FIXT.1.19=12435=A34=149=INITIATOR50=SNDRSUBID52=20220719-09:00:37.15956=ACCEPTOR98=0108=30141=Y553=USERNAME554=PASSWORD1137=91409=010=083
8=FIXT.1.19=00009135=A49=ACCEPTOR56=INITIATOR34=152=20220719-09:00:37.17820002=2698=0108=30141=Y1409=01137=910=061
8=FIXT.1.19=12435=A34=149=INITIATOR50=SNDRSUBID52=20220719-09:01:30.15856=ACCEPTOR98=0108=30141=Y553=USERNAME554=PASSWORD1137=91409=010=076
8=FIXT.1.19=00009135=A49=ACCEPTOR56=INITIATOR34=152=20220719-09:01:30.17520002=2698=0108=30141=Y1409=01137=910=052
8=FIXT.1.19=12435=A34=149=INITIATOR50=SNDRSUBID52=20220719-09:02:30.15856=ACCEPTOR98=0108=30141=Y553=USERNAME554=PASSWORD1137=91409=010=077
8=FIXT.1.19=00009135=A49=ACCEPTOR56=INITIATOR34=152=20220719-09:02:30.17520002=2698=0108=30141=Y1409=01137=910=053

What I've Tried

Added TestRequestDelayMultiplier = 1 to the session properties.

Found this topic it looks relative but there is no answer.

session.properties

[DEFAULT]
ConnectionType=initiator
ReconnectInterval=60
ResetOnLogon=Y
FileLogPath=logs/Client_Logs
SenderCompID=INITIATOR
SenderSubID=SNDRSUBID
ValidateIncomingMessage=N
TestRequestDelayMultiplier=1

[SESSION]
BeginString=FIXT.1.1
TargetCompID=ACCEPTOR
StartDay=sunday
EndDay=friday
StartTime=21:35:00
EndTime=21:30:00
HeartBtInt=30
CheckLatency=N
SocketConnectPort=port
SocketConnectHost=host
DefaultApplVerID=9
UseDataDictionary=Y
DataDictionary=config/cptyFIX50sp2.xml
FileStorePath=logs/Client_Seq_Store
TransportDataDictionary=config/FIXT11.xml
AppDataDictionary=config/cptyFIX50sp2.xml
Christoph John
  • 3,003
  • 2
  • 13
  • 23
estherial
  • 39
  • 5
  • Where are initiator and acceptor located? I assume in different networks or even separated by internet, maybe via VPN? When there is a `Disconnecting: Timed out waiting for heartbeat` and the connection can't be established afterwards (as shown by `Pending connection not established after...` in your log) then my first guess is a problem in the network. To the other side the connection may still seem vital and it keeps sending messages but they don't reach your side because of the aforementioned network problem. – Christoph John Jul 20 '22 at 15:26
  • @ChristophJohn Yes, they're on different networks. Messages may be going through VPN regardless of us, for security. These event and message logs are ours. So I can see test request's response in my logs from acceptor. That's why I didn't think a network issue. Because messages are reaching to us? If you need any detail more than these I can share with you. – estherial Jul 21 '22 at 08:09

1 Answers1

0

I think the problem is that the counterparty does not reply to your TestRequest message with the full SessionID. Normally this is a combination of SenderCompID and TargetCompID.

However, in your settings you specify SenderSubID=SNDRSUBID but the counterparty does not include the SNDRSUBID on their messages. In my opinion they should send it back to you as TargetSubID.

Could you try to remove the SenderSubID from your settings and see if that works? If you need the SenderSubID on some of your messages then please only set it programatically on selected messages and not for the complete session.

Christoph John
  • 3,003
  • 2
  • 13
  • 23
  • I tried like that. Counterparty rejected that logon message with required tag missing. They're waiting for RefTagID = 50. – estherial Jul 21 '22 at 10:24
  • Then they should correctly respond with `TargetSubID=SNDRSUBID` on all of their messages. Alternatively, you could remove SenderSubID=SNDRSUBID from your config and only set the `SenderSubID` on your `Logon` message in the `toAdmin()` callback. Or another alternative: configure `LogonTag=50=SNDRSUBID` in your settings (if you have a recent QFJ version and need the tag on the `Logon` message **only**). – Christoph John Jul 21 '22 at 10:39
  • I just set tag 50 programmaticly for logon message. It logged in but when my application send a heartbeat they rejected with required tag missing 50. – estherial Jul 21 '22 at 12:11
  • Then also set it on all admin messages via `toAdmin()`. – Christoph John Jul 21 '22 at 12:19
  • Isn't it the same thing as putting the SenderSubID on the session.properties? – estherial Jul 21 '22 at 12:24
  • No because when you put it into your session properties, then QFJ considers it as part of the SessionID. IMHO that is why QFJ does not consider the TestRequest reply correctly. – Christoph John Jul 21 '22 at 12:30
  • Ok, I'll try that. This can take a while. Thank you so much. – estherial Jul 21 '22 at 12:32
  • Chris, I added tag 50 all admin messages. Today we couldn't get a heartbeat from acceptor for 30 seconds(heartbeat interval is 30) then our application sent a test request. Acceptor didn't respond to this request. After that our application sent a logon request. We are successfully logged in. No loop. Problem looks solved but no response to test request which makes me think about it. I'll monitor this for a while. Thanks for your help. – estherial Jul 22 '22 at 08:42
  • When you are logged in now and heartbeats are exchanged then I think the first problem of today was a different one. Which counterparty are you connecting to BTW? – Christoph John Jul 22 '22 at 08:54
  • Hi Chris, same thing happened today. Last heartbeat came from acceptor at 20220725-08:50:15.398. Our initiator sent a TestRequest at 20220725-08:50:22.900. Acceptor sent a heartbeat with exact TestReqID. After that logon loop happens. It's odd which our initiator sends a test request even it gets heartbeat. I can't share counterparty because of privacy. – estherial Jul 25 '22 at 09:18
  • I assume you only are aware of the problem after it happens? Otherwise next time please create a series of thread dumps starting from the time where you get the last heartbeat before the test request. Otherwise it is just guesswork. In summary I think it is a problem which seems to be specific to your environment. If heartbeats/test request wouldn't work in general then there would be more users reporting this. Of course there is a low probability that it could be a bug that is only triggered under specific circumstances. – Christoph John Jul 25 '22 at 11:03
  • Also, what are you doing with the messages inside your fromApp/fromAdmin method? Something seems fishy, especially in conjunction with the connection that cannot be established after the problem occurred. – Christoph John Jul 25 '22 at 14:21
  • Yes, I notice after it happens. fromApp method just has crack(message, sessionId); fromAdmin is empty. – estherial Jul 25 '22 at 15:45
  • Could it be that it always or sometimes happens after you received app messages? At least in your initial post there is an app message shortly before. – Christoph John Jul 25 '22 at 15:56
  • Some extra info, The acceptor sends too many execution messages and trade capture reports in a very short time. This problem occurs in there. I've seen a few TestRequests(there were just heartbeats not business messages), and they've logged in without any problems. – estherial Jul 25 '22 at 15:57
  • Ok as i assumed in comment above. So either problem in your app processing or maybe garbage collection going on? Thread dumps should help to pinpoint the issue. – Christoph John Jul 25 '22 at 15:58
  • Could you advice a way or method for thread dump in this situation. I just want to give you the best output. – estherial Jul 25 '22 at 16:01
  • https://stackoverflow.com/questions/36700586/how-to-capture-thread-dump-programatically-using-java-code/36701651#36701651 You should do that when you sent a TestRequest. Please upload the info somewhere since it will be too much to insert here. – Christoph John Jul 25 '22 at 17:28
  • And BTW what are you doing in the crack() method? Lengthy processing in there maybe? – Christoph John Jul 25 '22 at 19:11
  • crack method is from quickfix core, MessageCracker.class So It's default. – estherial Jul 26 '22 at 09:45
  • Ok but where do you process your app messages? – Christoph John Jul 26 '22 at 13:04
  • In onMessage methods. – estherial Jul 26 '22 at 21:01
  • Please try to remove all your message processing logic and check if the problem persists. If yes, please provide the thread dumps as described earlier. – Christoph John Jul 27 '22 at 06:20
  • We removed transaction logics. Monitored the application for 3 days. Twice a day, We are sending TestRequest because we didn't get heartbeats. Acceptor didn't respond our TestRequests. After that we logged in successfully. We never saw the main problem which is logon loop. It looks like our transaction methods has a problem. But may i ask you, if we get an exception in onMessage methods(with exception handling) does it affect connection between acceptor and initiator? – estherial Jul 28 '22 at 08:31
  • onMessage() (which usually is called from within the fromApp() callback) could affect the connection if you do some lengthy processing in that method. Since onMessage() will be executed on the same thread that handles incoming messages it will affect the connection in the same way you were seeing: QFJ sends a TestRequest and the reply (as Heartbeat msg) is stuck in the queue of messages to be processed but seems to hang there for too long which eventually triggers a disconnection due to Heartbeat timeout. So either speed up the processing of your app messages or move that to a separate thread. – Christoph John Jul 28 '22 at 09:12