8

I have an issue with MongoDB becoming unresponsive and refusing connections in production.

In testing, I'm able to consistently reproduce the "lock out" by sending lot of requests to my app.

Here is an example output from mongostat:

     0      0      0      0       0       1       0   1.3g   1.8g   177m      0        0          0       0|0     0|0    62b     1k    18   19:40:11
     0      0      0      0       0       1       0   1.3g   1.8g   177m      0        0          0       0|0     0|0    62b     1k    18   19:40:12
--------------LOAD TEST BEGINS-----------------------   
     1     56      0      0       1      58       0   1.3g  1.81g   177m      0        0          0       0|0     0|0    10k   302k    18   19:40:13
    10    116      0      0       2     127       0   1.3g  1.81g   177m      0        0          0       0|0     1|0    32k   263k    19   19:40:14
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   206    431      0      0       2     638       0   1.3g  1.82g   177m      0      1.1          0       0|0     0|0   206k   339k    22   19:40:15
   331    649      0      0       2     983       0   1.3g  1.81g   177m      0        1          0       0|0     0|0   317k   521k    21   19:40:16
   367    723      0      0      10    1090       0   1.3g  1.81g   177m      0      0.9          0       0|1     1|1   297k     1m    20   19:40:17
   500    961      0      0      11    1461       0   1.3g  1.81g   177m      0      1.2          0       0|0     0|0   390k     1m    19   19:40:18
   543   1073      0      0      10    1618       0   1.3g  1.82g   177m      0      1.2          0       0|0     2|0   433k     1m    21   19:40:19
   461    909      0      0      11    1370       0   1.3g  1.81g   177m      0      1.1          0       0|0     2|0   365k     1m    20   19:40:20
   347    685      0      0       6    1033       0   1.3g  1.81g   177m      0      0.8          0       0|0     1|0   411k   962k    21   19:40:21
   482    938      0      0       9    1421       0   1.3g  1.81g   177m      0      1.1          0       0|0     0|0   492k     1m    20   19:40:22
   444    877      0      0      10    1322       0   1.3g  1.81g   177m      0      1.1          0       0|0     0|1   479k     1m    22   19:40:23
   528   1019      0      0       9    1549       0   1.3g  1.81g   177m      0      1.2          0       0|0     0|0   446k     1m    21   19:40:24
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   520   1033      0      0      10    1554       0   1.3g  1.81g   177m      0      1.2          0       0|0     1|0   418k     1m    22   19:40:25
   448    878      0      0      10    1326       0   1.3g  1.81g   177m      0      1.2          0       0|0     1|0   349k     1m    22   19:40:26
   503    979      0      0       8    1483       0   1.3g  1.82g   177m      0      1.3          0       0|0     1|0   397k     1m    21   19:40:27
   379    743      0      0      10    1123       0   1.3g  1.81g   177m      0        1          0       0|0     0|0   315k     1m    22   19:40:28
   456    896      0      0       7    1354       0   1.3g  1.81g   177m      0      1.1          0       0|0     0|0   436k     1m    22   19:40:29
   371    715      0      0      10    1086       0   1.3g  1.81g   175m      0      0.9          0       0|0     0|0   299k     1m    20   19:40:30
   505    993      0      0       7    1499       0   1.3g  1.81g   175m      0      1.2          0       0|0     2|0   394k     1m    21   19:40:31
   493   1023      0      0      10    1519       0   1.3g  1.82g   175m      0      1.1          0       0|0     1|0   392k     1m    22   19:40:32
   426    868      0      0      11    1295       0   1.3g  1.81g   170m      0        1          0       0|0     1|0   363k     1m    21   19:40:33
   352    772      0      0       7    1125       0   1.3g  1.81g   163m     19      4.4          0       0|1     0|1   368k     1m    22   19:40:34
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   330    656      0      0       5     985       0   1.3g  1.81g   159m      4      1.5          0       0|0     0|0   325k   876k    19   19:40:35
   398    817      0      0       8    1218       0   1.3g  1.81g   155m      9      1.3          0       0|0     1|0   372k     1m    21   19:40:36
   464   1015      0      0       8    1479       0   1.3g  1.81g   155m      2      1.2          0       0|0     1|0   381k     1m    21   19:40:37
   457    915      0      0      11    1372       0   1.3g  1.81g   155m      0      1.1          0       0|0     0|0   356k     1m    22   19:40:38
   458    960      0      0       7    1419       0   1.3g  1.83g   155m      0      1.1          0       0|0     0|0   376k     1m    22   19:40:39
   391    760      0      0      11    1153       0   1.3g  1.81g   155m      0        1          0       0|0     1|0   321k     1m    21   19:40:40
   315    639      0      0       5     956       0   1.3g  1.81g   155m      0      0.8          0       0|0     0|0   325k   877k    21   19:40:41
   303    696      0      0       7     998       0   1.3g  1.81g   155m      0      0.8          0       0|0     0|0   279k     1m    21   19:40:42
   388    772      0      0       8    1163       0   1.3g  1.81g   155m      0        1          0       0|0     0|0   308k     1m    20   19:40:43
   273    623      0      0       7     895       0   1.3g  1.81g   155m      0      0.7          0       0|0     2|0   232k     1m    20   19:40:44
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
   397    796      0      0       7    1194       0   1.3g  1.81g   155m      0      0.9          0       0|0     0|0   311k     1m    19   19:40:45
   347    754      0      0       4    1103       0   1.3g  1.81g   155m      0      0.9          0       0|0     0|0   329k   771k    20   19:40:46
   336    700      0      0       8    1042       0   1.3g  1.82g   160m     20      0.8          0       2|2     3|2   370k     1m    23   19:40:47
   272    525      0      0       5     805       0   1.3g  1.79g   167m    188      0.6          0       0|0     0|0   306k   818k    18   19:40:48
   103    246      0      0       1     350       0   1.3g  1.78g   166m      0      0.4          0       0|0     0|0   136k   234k    15   19:40:49
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:50
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:51
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:52
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:53
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:54
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:55
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:56
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:57
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:58
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:40:59
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:00
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:01
     0      3      0      0       0       4       0   1.3g  1.78g   166m      0        0          0       0|0     0|0   775b     4k    15   19:41:02
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:03
     0      0      0      0       0       1       1   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:04
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:05
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:06
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:07
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:08
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:09
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:10
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:11
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:12
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:13
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:14
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:15
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:16
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:17
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:18
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:19
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:20
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:21
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:22
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:23
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:24
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:25
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:26
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:27
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:28
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:29
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:30
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:31
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:32
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:33
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:34
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:35
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:36
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:37
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:38
     0      0      0      0       0       1       0   1.3g  1.78g   166m      0        0          0       0|0     0|0    62b     1k    15   19:41:39
     0     28      0      0       0      29       0   1.3g  1.78g   166m      0        0          0       0|0     0|0     5k    27k    15   19:41:40
    86    234      0      0       1     322       0   1.3g  1.79g   166m      0      0.2          0       0|0     1|0   145k   217k    17   19:41:41
   216    445      0      0       5     663       0   1.3g  1.79g   167m      0      0.5          0       0|0     0|0   220k   807k    16   19:41:42
   302    612      0      0       6     916       0   1.3g  1.79g   167m      0      0.6          0       0|0     0|0   236k   950k    16   19:41:43
   113    247      0      0       1     362       0   1.3g  1.79g   167m      0      0.2          0       0|0     0|0    95k   192k    17   19:41:44
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time
    38     63      0      0       2     103       0   1.3g  1.79g   167m      0        0          0       0|0     0|0    26k   289k    16   19:41:45
--------------LOAD TEST ENDS----------------------- 
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:46
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:47
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:48
     0      0      0      0       0       2       0   1.3g  1.79g   167m      0        0          0       0|0     0|0   124b     2k    16   19:41:49

As you can see, in the middle of the load test, the reads and writes drop to pretty much nothing. During that time, I'm unable to connect locally to MongoDB via mongo or mongostat (connection refused). As you can also see, the connections is pretty low and there are 16000 connections in the max connections for mongo.

I'm able to reproduce this consistently. However, each time the "lock out" is a different length. I've observed 51 seconds, 20 seconds, 28 seconds. In each instance, I'm still connected to mongo via mongostat (and receiving output), but new connections will be refused.

user@myhostname:~# mongostat
couldn't connect to [127.0.0.1] couldn't connect to server 127.0.0.1

What could be the issue? I'm running MongoDB v1.8.2.

As an aside, I'm able to reproduce a similar issue, where the commands don't drop off entirely, but are low in volume and fluctuate wildly.

Homer6
  • 15,034
  • 11
  • 61
  • 81
  • I'm curious about this, but as the close suggests, probably better suited @ Server Fault – tymeJV Sep 06 '13 at 18:30
  • @tymeJV I'm the lead developer of this app and I'm also looking for a solution that may be in the MongoDB source code. TBH, I'll take a resolution for any source. :-) – Homer6 Sep 06 '13 at 18:32
  • Mongo used to have some pretty coarsely grained locks - have you tried a newer version? – Frederick Cheung Sep 06 '13 at 19:37
  • @FrederickCheung I'm in the process of trying it now. Will post back with the results. I'm assuming that if locks were the problem, it might be indicated with the lock% metric, but you may be right about a different lock that isn't reported on (perhaps even a temporary deadlock). – Homer6 Sep 06 '13 at 19:39
  • @FrederickCheung Upgrading to 2.4.6 has no effect. The problem still exists. – Homer6 Sep 06 '13 at 20:09
  • If anything, its actually easier to trigger now. It'll lock out several times in each load test, whereas it would only do it once or twice per run before. – Homer6 Sep 06 '13 at 20:17
  • I must admit I have never seen this issue, what is your OS? I have a feeling the answer lies in that 1 command constantly running – Sammaye Sep 06 '13 at 20:48
  • @Sammaye It's running Ubuntu 10.04.2 LTS 64 bit running on Xen(linode) - Linux 2.6.39.1 x86_64 - 1GB RAM - 8 Cores of Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz - 512MB Swap (and there are no syslog errors) – Homer6 Sep 06 '13 at 20:57
  • Hmm you have 1.3GB RAM mapped and apparently 1.78 as virtul...that doesn't add up for one – Sammaye Sep 06 '13 at 20:58
  • Why should that add up? The difference alone could be stack size. – Homer6 Sep 06 '13 at 21:00
  • Hmmm how so? I am unsure about that – Sammaye Sep 06 '13 at 21:02
  • Well... it may not be the case for mongo, but I've seen major variations in virtual memory, especially based off of the number of threads. I'm just not sure why it not adding up would be an indication of an issue. What am I missing? – Homer6 Sep 06 '13 at 21:03
  • If MongoDB is artifically pushing for RAM beyond the amount that exists that will bring the DB to a grinding halt, it could also cause problems for mmap, but you are getting so few faults, but also it not adding up makes me concerned about why it is not – Sammaye Sep 06 '13 at 21:05
  • @Sammaye I'm upgrading the box to a much stronger box and will post the results (8GB-based linode server). We'll see if that makes any difference. – Homer6 Sep 06 '13 at 21:14
  • I also remember hearing something about linode, I will search JIRA – Sammaye Sep 06 '13 at 21:15
  • @Sammaye Unfortunately, there is little, if any, effect by upgrading the box to 8GB – Homer6 Sep 06 '13 at 22:13
  • Hmmm, I'll keep thinking – Sammaye Sep 06 '13 at 22:17
  • @Sammaye Fixed it. Thx for looking into it with me. – Homer6 Sep 07 '13 at 00:24

2 Answers2

0

Fixed it:

sudo sysctl net.ipv4.tcp_tw_reuse=1

Then, restart mongo.

Alternatively, you can add it to /etc/sysctl.conf (so it will be run on reboot):

net.ipv4.tcp_tw_reuse=1

Then run this to reload (without having to reboot)

sudo sysctl -p /etc/sysctl.conf

This "fix" will disable the timewait state for TCP sockets (server-wide). So, it's really not a fix at all. However, until mongo reduces their timewait state using SO_LINGER, a large number of server sockets will pool up in the TIME_WAIT state and remain unusable for new connections. You can see the count of connections in TIME_WAIT with this:

netstat -an | grep TIME_WAIT | wc -l

With that, I was able to see it fail at around 28k TIME_WAIT connections. Using this kernel flag:

sysctl net.ipv4.ip_local_port_range="18000 65535"

The server fails at 45k connections. So, to reproduce the error more easily, you could lower the range to 200 or something small.

So, the result of this was a programming question after all (as you can see from the last link):

TCP option SO_LINGER (zero) - when it's required

http://alas.matf.bg.ac.rs/manuals/lspe/snode=105.html

Community
  • 1
  • 1
Homer6
  • 15,034
  • 11
  • 61
  • 81
  • I have created a bug report in Mongo's JIRA that relates to this issue: https://jira.mongodb.org/browse/SERVER-10710 – Homer6 Sep 08 '13 at 18:33
  • Also, another note, when I upgraded from 1.8.2 to 2.4.6, the pattern changed. Under 2.4.6, the "lock out" would occur several times more per load test (and for shorter periods). – Homer6 Sep 08 '13 at 18:35
0

Spent a ton of time debugging a similar issue. In the end, it turned out we had too many short lived connection, with each one getting closed after one small operation. However, this was leaving a bunch of CLOSE_WAIT connections on the mongo server process and the exact same symptom you described in the question. The solution was to reuse connections instead of opening and closing a connection every command.

Sohan
  • 3,757
  • 2
  • 24
  • 24