I have spent days hunting down a connection problem without any luck. I'm trying to implement a relatively simple one2one Call with Kurento.
Below you will find a debug log of Kurento of a case where the connection could be established and a case where the connection failed.
If you need any more logs (eg. of the clients, the signalling server, tcpdumps, or trace logs of Kurento just let me know and I will provide!)
Any help or new input is greatly appreciated!
Description of the Problem:
In about 30% of cases, the WebRTC connection cannot be established. Unfortunately I'm short of any kind of patttern when the Connection can be established and when not, it seems completely random. I'm in the same network, using the same devices, using the same TURN server, using the same signalling protocol, but in 30% of cases the connection cannot be established.
When I run the application locally, it seems to work much more reliably, the connection can be established almost 100% of the time (or maybe even 100% of time, I have tested so many times I lost track). I set up the infrastructure locally with docker, and run the different containers (TURN, Kurento, Signalling) in separate networks to mimic a production deployment.
We experience the same behavior in our development and production environment. In our development environment we have absolutely no firewalls in place, so that doesn't seem to be the problem.
What I have tried to find the cause of the Problem:
Mostly I have been comparing logs of cases that worked and cases that didn't work but I have failed to find any significant difference between them that could point me to the problem.
I have tested the WebRTC connection over the TURN server (with Firefox and the force_relay flag) and over Kurento directly, but in both cases the connection fails in ~30% of cases.
I have tried filtering all ICE candidates that are not Relay candidates.
I have sniffed traffic between our signalling server (which also controls Kurento) and Kurento to see any difference in the JSON RPS messages exchanged but they appear to be essentially the same.
I have tested our STUN and TURN server using this tool: https://webrtc.github.io/samples/src/content/peerconnection/trickle-ice/ and I get both serverreflexive and relay candidates that look correct
I have sniffed the traffic from the clients of a successful and unsuccessful connection but could spot a significant difference
I have simplified the Kurento media pipeline (no recording, no Hubs) but the behavior is the same
I have used different browsers (Chrome, Firefox and a native iOS implementation) but the behavior is the same
Kurento debug logs of a case where the connection could be established:
https://gist.github.com/omnibrain/2bc7ad54f626d278d3c8bac29767ac4c
Kurento debug logs of a case where the connection could NOT be established:
https://gist.github.com/omnibrain/f7caee04a5c6d77ea22a9ccfa95dd825