11

When I setup a data channel between 2 browsers (testing on 2 different machines on the same network), I get different results regarding lag in the following 2 cases.

Case 1: sending / receiving only

When I setup one side to be sending test messages, with an interval of for example 70ms, I see them coming in on the other side without noticeable lag. The time between each received message is close to 70ms. So far so good.

Case 2: Both sides sending and receiving in turn

When I setup both sides to send a message as soon as it received a message from the other side AND it is more than 70ms ago since last sending, everything goes fine, except for sometimes. Every few seconds (not consistent) I measure a delay of ~1000ms. The weird thing is, the time between the vast majority of messages is either < 200ms OR > ~1000ms.


I tested both cases in (combinations of) chrome and firefox, the behavior was similar. I also tested it on a mobile phone network (using tethering), which showed the same lag, although less often. The data channel was not configured with any special options, so it uses a reliable, ordered connection.

What could be causing this? It seems to me that it can be fixed, since sending in one direction (either way) works fine without lag. I also tried using a separate data channel for sending/receiving, which didn't matter.


Examples

Here is an example of test results for the second case. It's a list of all the round trip times that were higher than 200ms for 1000 round trips.

(Delay index) round trip time - round trip number - time
(0) 2192 - 0 - "2016-05-06T12:34:18.193Z"
(1) 1059 - 111 - "2016-05-06T12:34:22.777Z"
(2) 1165 - 372 - "2016-05-06T12:34:32.485Z"
(3) 1062 - 434 - "2016-05-06T12:34:35.585Z"
(4) 1157 - 463 - "2016-05-06T12:34:37.598Z"
(5) 1059 - 605 - "2016-05-06T12:34:43.264Z"
(6) 1160 - 612 - "2016-05-06T12:34:44.633Z"
(7) 1093 - 617 - "2016-05-06T12:34:45.857Z"
(8) 1158 - 624 - "2016-05-06T12:34:47.204Z"
(9) 1162 - 688 - "2016-05-06T12:34:50.401Z"
(10) 1158 - 733 - "2016-05-06T12:34:52.962Z"
(11) 1161 - 798 - "2016-05-06T12:34:56.163Z"
(12) 1157 - 822 - "2016-05-06T12:34:58.077Z"
(13) 1158 - 888 - "2016-05-06T12:35:01.281Z"
(14) 1160 - 893 - "2016-05-06T12:35:02.563Z"
(15) 1085 - 898 - "2016-05-06T12:35:03.768Z" 

Here is another example, including a 'PacketsSentPerSecond' graph from chrome://webrtc-internals:

PacketsSentPerSecond graph

In this test, ~2100 packets were sent, resulting in the following 26 round trips that took more than 900ms: [1762.6050000000014, 1179.7200000000012, 1765.375, 1149.945000000007, 1180.1399999999994, 1180.9550000000017, 1246.2450000000026, 1750.2649999999994, 1388.0149999999994, 1100.7499999999854, 4130.475000000006, 1160.1150000000052, 1082.4399999999878, 1055.2300000000105, 1498.715000000011, 1105.8850000000093, 1478.1600000000035, 2948.649999999994, 1538.2549999999756, 1839.9099999999744, 1768.6449999999895, 1167.929999999993, 1139.1750000000175, 1173.8850000000093, 1245.6600000000035, 1075.375]

I still didn't figure out what is causing this lag. I would expect a much smoother graph.

user125661
  • 1,558
  • 12
  • 28
  • Maybe a bug in your code. – jib May 05 '16 at 23:49
  • I was triggered by this post: http://stackoverflow.com/questions/19475894/settimeout-setinterval-1000ms-lag-in-background-tabs-chrome-and-firefox. Although my problem isn't about settimeout in background tabs, I have this feeling it might be caused by something similar.. – user125661 May 06 '16 at 12:20
  • Thanks for the link! Are you keeping both tabs/browsers focused when you experience this? If you unfocus the tab or the browser yes then I would expect this. W3C is still looking for drivers to spec out [datachannel in workers](https://github.com/w3c/webrtc-pc/issues/230). – jib May 06 '16 at 16:38
  • Yes, I keep them both focused, so this is not what I am experiencing. In the example included in my question, you see that out of 1000 packets, only 16 are delayed more than 200ms. Any other ideas? – user125661 May 06 '16 at 17:18
  • 1
    Smells like garbage collection – Alex Cohn May 06 '16 at 17:29
  • @AlexCohn, would garbage collection take that long? – user125661 May 06 '16 at 17:38
  • Also, since the lag does not occur when trying the script in 2 tabs in the same browser, I doubt if that's what's happening. Although there might be some extra memory overhead when actually sending between different devices. – user125661 May 06 '16 at 17:45
  • Even though you seem to have put a lot of effort into your question, it's not a programming question and therefore off-topic. You didn't even include any code! – Kevin May 10 '16 at 08:09
  • 1
    Just a silly thought, do you use a device with a battery? If so, can you test again with the power plugged in? I had weird delays in my voip app about a year ago and it turned out my customers used some tablet without the power cord, which lowered the network card's priority and caused lag. – Kevin May 10 '16 at 08:43
  • Thanks for your suggestion. Both devices were plugged in, so this isn't the cause either. But I have found a solution, which I will post as an answer. – user125661 May 10 '16 at 16:49

3 Answers3

2

Although I'm still unsure what is causing the problem, I have found a solution. My best guess is that the problem is caused by flow control when one of the peers is not sending data for a while (or they just don't reach the other).

I noticed there are no problems when both peers are sending packets to each other a 70ms interval, when they don't wait for a packet from each other. As soon as I delay sending a packet while waiting for an incoming packet, I get the >1000ms lags.

So what I do now is actually sending packets at a steady rate EVEN if they are empty. My application requires sending data in turn, but I just check at an interval if there is anything to send, and if not, I still send an empty packet. This way, the problem seems solved in the tests I did so far!

user125661
  • 1,558
  • 12
  • 28
  • I wonder, before the fix, how did the dialogue ever recover, if a packet got lost? Maybe, a dropped packet was retransmitted ~1sec later? – Alex Cohn May 11 '16 at 05:13
  • As mentioned I used a reliable, ordered data channel, which means that dropped packets will be retransmitted. So sooner or later the packet would arrive :). The same applies to the situation after the fix. My application doesn't resend packets itself, it just sends a more constant flow of packets (some being empty) – user125661 May 13 '16 at 06:20
  • So my hypothesis is quite probable. If you don't control retransmission, you cannot trust timing of packages that arrive. – Alex Cohn May 13 '16 at 14:15
  • Sure, but the thing is that I didnt change anything about that. I don't retransmit any packets as I think you assume. I just send extra empty packets with a steady interval. So I am still dependent on the same filled packet to arrive soon, but somehow (I guess due to soms flow control), it arrives sooner now. – user125661 May 14 '16 at 06:20
  • I understood that there was no retransmission on the application level. Probably, in your scenario you should prefer the data channel that is not reliable, if you can cope with that some data will get lost. Otherwise, you need protection of packet bursts (after retransmission succeedes, the delayed packets may arrive without the expected delay). – Alex Cohn May 14 '16 at 10:38
1

Perhaps it has something to do with the 1000ms lag people are discussing? (like this setTimeout/setInterval 1000ms lag in background tabs (Chrome and Firefox))

You configured your sending interval to 70ms, which is a relatively small interval. Have you tried to use a larger interval? Also you might also want to do some testings using WebRTC iOS or Android native solution, so that you can know if the issue is from the core WebRTC implementation (seems unlikely to me), or some browser limitation.

Community
  • 1
  • 1
Stephenye
  • 806
  • 6
  • 12
  • Thanks for your answer. I confirmed that it has nothing to do with the setTimeout/setInterval. I didn't know about that native code, thanks for the suggestion! – user125661 May 13 '16 at 06:23
0

I am almost sure that this is caused by your TURN server. I have done a very similar test on the last month and all packets were received within a few millisecond via TURN (using our own TURN server). The test was done with both Firefox and Chrome.

Istvan
  • 1,591
  • 1
  • 13
  • 19