7

(Cross posted to lwt github issues)

I have boiled down my usage to this code sample which will leak file descriptors.

say you have:

#require "lwt.unix"

open Lwt.Infix

let echo ic oc = Lwt_io.(write_chars oc (read_chars ic))

let program =
  let server_address = Unix.(ADDR_INET (inet_addr_loopback, 2000)) in

  let other_addr = Unix.(ADDR_INET (inet_addr_loopback, 2001)) in

  let server = Lwt_io.establish_server server_address begin fun (tcp_ic, tcp_oc) ->
      Lwt_io.with_connection other_addr begin fun (nc_ic, nc_oc) ->

        Lwt_io.printl "Created connection" >>= fun () ->
        echo tcp_ic nc_oc <&> echo nc_ic tcp_oc >>= fun () ->
        Lwt_io.printl "finished"

      end
      |> Lwt.ignore_result

    end
  in
  fst (Lwt.wait ())

let () =
  Lwt_main.run program

and then you create a simple server with:

nc -l 2001

and then let's start up the OCaml code with utop example.ml

and then open up a client

nc localhost 2000
blah blah
^c

Then looking at the connections for port 2000 using lsof, we see

ocamlrun 71109 Edgar    6u  IPv4 0x7ff3e309cb80aead      0t0  TCP 127.0.0.1:callbook (LISTEN)
ocamlrun 71109 Edgar    7u  IPv4 0x7ff3e309c9dc8ead      0t0  TCP 127.0.0.1:callbook->127.0.0.1:54872 (CLOSE_WAIT)

In fact for each usage of nc localhost 2000, we'll get a leftover CLOSE_WAIT record from the lsof usage.

Eventually this will lead to the system running out of file descriptors, which will MOST annoyingly not crash the program, but will lead to Lwt to just hang.

I can't tell if I am doing something wrong or if this is a genuine bug, in any case this is a serious bug for me and I run out of file descriptors within 10 hours...

EDIT: It seems to me that the problem is that one side of the connection is closed but the other isn't, I would have thought that with_connection should cleanup/close up whenever either side closes, aka whenever nc_ic or nc_oc close.

EDIT II: I have tried every which way where I manually close the descriptors with Lwt_io.close, but I still have the CLOSE_WAIT message.

EDIT III: Even used Lwt_unix.close on a raw fd given to with_connection's optional fd argument with similar bad results.

EDIT IV: Most insidious is if I use Lwt_daemon.daemonize, then this problem seemingly goes away

Stas
  • 11,571
  • 9
  • 40
  • 58
  • I grepped your code and didn't find any calls to `close`. I'm not surprised that fd are leaking – ivg Jan 12 '16 at 00:28
  • @ivg Why do I need to call close at all, and presumably that would be just for the server's ic, oc. Isn't the point of with_connection to handle this for me. –  Jan 12 '16 at 00:29
  • The docstring of `establish_server` says (about its function argument) "Note that [f] must not raise any exception.". I would look there first. – gsg Jan 12 '16 at 04:58
  • @gsg no exceptions are being raised. –  Jan 12 '16 at 05:00
  • How about calling `Lwt_io.shutdown_server`? – gsg Jan 12 '16 at 05:05
  • I don't need to shutdown the server. The most psychotic thing is that this error does away when using Lwt_daemon.daemonize. –  Jan 12 '16 at 05:10
  • @EdgarAroutiounian We had a similar problem a year ago, unfortunately I don't remember the details (CLOSE_WAIT, Lwt, running out of file descriptors). It went away as soon as we started watching the process with strace, which doubles our CPU usage. We've been using `sudo strace -t -p "$pid" -ff -e open,connect,accept,close,shutdown -o "$dir"/strace-$date-$pid.log`. – Martin Jambon Jan 12 '16 at 18:34
  • @MartinJambon incredibly unsatisfying solution –  Jan 12 '16 at 19:04
  • @EdgarAroutiounian I wouldn't post a solution in the comments section. strace is supposed to help you find the source of the leak. – Martin Jambon Jan 12 '16 at 21:52
  • @MartinJambon strace itself just get stuck often times on a read call of a socket. –  Jan 12 '16 at 21:53
  • 1
    I want to note that this has (I believe) been fixed in Lwt 3.0.0 (released today) with a new `Lwt_io.establish_server` that closes the sockets automatically, and has a slightly different type signature. If you want, I can make that into an answer. – antron Apr 19 '17 at 21:53
  • 1
    @antron Sure, I need to upgrade my code as well to reflect these awesome new changes, fixes. –  Apr 19 '17 at 22:05

2 Answers2

5

First, it is not clear why you use join <&> instead of choose <?>. I guess the connection should be closed if one of both sides wants to close it.

Concerning CLOSE_WAIT: it is half-closed connection from utop server to nc client.

A TCP connection consists of two half-connections, and they are closed independently. The connection from nc client to utop server was closed by nc due to Ctrl-C. But you have to explicitly close the opposite connection on server side by closing output stream. I'm not sure why Lwt.establish_server doesn't close it automatically. Possible, this is a design issue.

This works for me on CentOS 7:

Lwt_io.printl "Created connection" >>= fun () ->
echo tcp_ic nc_oc <?> echo nc_ic tcp_oc >>= fun () ->
Lwt_io.close tcp_oc >>= fun () ->
Lwt_io.printl "finished"

Also, there is a simplified code snippet to reproduce the issue:

#require "lwt.unix"

let program =
  let server_address = Unix.(ADDR_INET (inet_addr_loopback, 2000)) in

  let _server = Lwt_io.establish_server server_address begin fun (ic, oc) ->
    (* Lwt_io.close oc |> Lwt.ignore_result; *) ()
  end
  in
  fst (Lwt.wait ())

let () =
  Lwt_main.run program

Run nc localhost 2000 several times to get connections in CLOSE_WAIT state. Uncomment the code to fix the issue.

Stas
  • 11,571
  • 9
  • 40
  • 58
  • Interesting but doesn't actually answer the question, nor does it address the difference in behavior given daemonize and ignores the usage of with_connection. –  Jan 13 '16 at 07:32
  • @EdgarAroutiounian updated. `with_connection` is not an issue there. The problem with `establish_server`. How do you use `daemonize`? – Stas Jan 13 '16 at 08:59
2

The underlying problem, at the time this question was asked, was that Lwt_io.establish_server did not make any effort at all to close the file descriptors associated with tcp_ic and tcp_oc. While this could (and should) have been addressed by users closing them manually, it was a weird and unexpected behavior.

The new Lwt_io.establish_server, available since Lwt 3.0.0, does try to close tcp_ic and tcp_oc automatically. To permit this, it has a slightly different type signature for the callback: the callback must return a promise, which you should resolve when tcp_ic/tcp_oc are not needed anymore. (EDIT) In practice, this means you just write your callback in natural Lwt style, and completion of the last Lwt operation will close the channels.

The new API also internally calls Lwt.async for running your callback, so you don't have to call that or Lwt.ignore_result.

You can still close the tcp_ic and tcp_oc manually in the callback, to write your own error handlers, which can be as elaborate as you please. The second automatic, internal close inside the new Lwt_io.establish_server won't have any harmful effect.

The new API was the eventual result of the parallel discussion of this question in the Lwt issue #208.

If someone would like the old, painful behavior, perhaps to reproduce the issue in the question, the old API is available for a while longer under the name Lwt_io.Versioned.establish_server_1.

antron
  • 3,749
  • 2
  • 17
  • 23