7

After a while on our WildFly 18 server, in production, we encountered this error:

[org.xnio.listener] (default I/O-1) XNIO001007: A channel event listener threw an exception: 
java.lang.OutOfMemoryError: Direct buffer memory
    at java.base/java.nio.Bits.reserveMemory(Bits.java:175)
    at java.base/java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:118)
    at java.base/java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:317)
    at org.jboss.xnio@3.7.3.Final//org.xnio.BufferAllocator$2.allocate(BufferAllocator.java:57)
    at org.jboss.xnio@3.7.3.Final//org.xnio.BufferAllocator$2.allocate(BufferAllocator.java:55)
    at org.jboss.xnio@3.7.3.Final//org.xnio.ByteBufferSlicePool.allocateSlices(ByteBufferSlicePool.java:162)
    at org.jboss.xnio@3.7.3.Final//org.xnio.ByteBufferSlicePool.allocate(ByteBufferSlicePool.java:149)
    at io.undertow.core@2.0.27.Final//io.undertow.server.XnioByteBufferPool.allocate(XnioByteBufferPool.java:53)
    at io.undertow.core@2.0.27.Final//io.undertow.server.protocol.framed.AbstractFramedChannel.allocateReferenceCountedBuffer(AbstractFramedChannel.java:549)
    at io.undertow.core@2.0.27.Final//io.undertow.server.protocol.framed.AbstractFramedChannel.receive(AbstractFramedChannel.java:370)
    at io.undertow.core@2.0.27.Final//io.undertow.websockets.core.AbstractReceiveListener.handleEvent(AbstractReceiveListener.java:38)
    at io.undertow.core@2.0.27.Final//io.undertow.websockets.core.AbstractReceiveListener.handleEvent(AbstractReceiveListener.java:33)
    at org.jboss.xnio@3.7.3.Final//org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:92)
    at io.undertow.core@2.0.27.Final//io.undertow.server.protocol.framed.AbstractFramedChannel$FrameReadListener.handleEvent(AbstractFramedChannel.java:950)
    at io.undertow.core@2.0.27.Final//io.undertow.server.protocol.framed.AbstractFramedChannel$FrameReadListener.handleEvent(AbstractFramedChannel.java:931)
    at org.jboss.xnio@3.7.3.Final//org.xnio.ChannelListeners.invokeChannelListener(ChannelListeners.java:92)
    at org.jboss.xnio@3.7.3.Final//org.xnio.conduits.ReadReadyHandler$ChannelListenerHandler.readReady(ReadReadyHandler.java:66)
    at org.jboss.xnio.nio@3.7.3.Final//org.xnio.nio.NioSocketConduit.handleReady(NioSocketConduit.java:89)
    at org.jboss.xnio.nio@3.7.3.Final//org.xnio.nio.WorkerThread.run(WorkerThread.java:591)

We checked a JVM dump through jxray, and it seems websockets are the culprit:

jxray report showing a lot of DirectByteBuffer used by websockets

Fact is that our websocket is kind of straightforward:

@ApplicationScoped
@ServerEndpoint(value = "/ws/messenger/{accountId}")
public class MessengerSocket implements Serializable
{
    private static final long serialVersionUID = -3173234888004281582L;

    @Inject
    private Logger log;
    @Inject
    private MessengerHandler handler;

    @OnOpen
    public void onOpen(@PathParam("accountId") String accountId, Session session, EndpointConfig config)
    {
        log.debug("Opening for {}", accountId);
        handler.subscribeSocket(session, UUID.fromString(accountId));
    }

    @OnClose
    public void onClose(@PathParam("accountId") String accountId, Session session, CloseReason closeReason)
    {
        log.debug("Closing {}", accountId);
        handler.unsubscribeSocket(session, UUID.fromString(accountId));
    }
}

It's coupled with a simple handler, managing a map of users sessions:

@ApplicationScoped
public class MessengerHandler
{
    @Inject
    private Logger log;

    // key: Account id
    private Map<UUID, AccountMessengerSessions> sessions;

    public void init()
    {
        sessions = new ConcurrentHashMap<>();
    }

    public void subscribeSocket(Session session, UUID accountId)
    {
        // build and store the account messenger session if new
        AccountMessengerSessions messenger = sessions.getOrDefault(accountId, new AccountMessengerSessions(accountId));
        messenger.getWsSessions().add(session);
        sessions.putIfAbsent(accountId, messenger);
        log.debug("{} has {} messenger socket session(s) (one added)", messenger.getAccountId(), messenger.getWsSessions().size());
    }

    /**
     * Unsubscribes the provided WebSocket from the Messenger.
     */
    public void unsubscribeSocket(Session session, UUID accountId)
    {
        if (!sessions.containsKey(accountId))
        {
            log.warn("Ignore unsubscription from {} socket, as {} is unknwon from messenger", session.getId(), accountId);
            return;
        }
        AccountMessengerSessions messenger = sessions.get(accountId);
        messenger.getWsSessions().remove(session);
        log.debug("{} has {} messenger socket session(s) (one removed)", messenger.getAccountId(), messenger.getWsSessions().size());
        if (!messenger.getWsSessions().isEmpty())
        {
            return;
        }
        // no more socket sessions, fully remove
        sessions.remove(messenger.getAccountId());
    }
}

Client side, we have a bit of javascript called when a page is loaded, again, nothing fancy:

var accountId = // some string found in DOM
var websocketUrl = "wss://" + window.location.host + "/ws/messenger/" + accountId;
var websocket = new WebSocket(websocketUrl);
websocket.onmessage = function (event) {
  var data = JSON.parse(event.data);
  // nothing fancy here...
};

Our users don't use a lot the feature offered by the websocket (an instant messenger), so what is really happening in production is basically websockets opening and closing at each page, with very few messages sent through.

Where could we get it wrong and create this buffer leak? Did we forget something critical?

Xavier Portebois
  • 3,354
  • 6
  • 33
  • 53
  • Are you sure unsubscribe/onClose is called without any issues for all WebSocket closure? Are you able to recreate this issue, if yes what are the steps to do so? I have two initial suspects 1. Unsubscribe is not called so sessions are accumulated. 2. On unsubscribe, you are not leaving some residue – Vinay KV Aug 24 '20 at 06:58
  • As mentioned by @VinayKV, are you sure that your sessions map is correctly being maintained and you aren't just building up more and more sockets? A 'sessioncount' metric to measure this would be useful. Also, this thread has some detail about older version of code causing onClose not to execute, might be something worth investigating - https://stackoverflow.com/a/45434399/1538039 – Dylan Morley Aug 24 '20 at 07:03
  • The onClose() and unsubscribe() methods are called, yes, and logs are ending signaling that each user doesn't have any more sessions opened ("user 123 has 0 messenger socket session(s) (one removed)"). – Xavier Portebois Aug 24 '20 at 07:08
  • Are you able to reproduce this issue? If yes, monitor the memory to see where the leak is when unsubscribe is called. looks like some residue when socket session is unsubscribed – Vinay KV Aug 24 '20 at 07:34
  • Sorry, forgot that part. No, I could not reproduce the issue. I tried with a Gatling scenario generating a lot of traffic, manually opening the websockets (as Gatling does not handle javascript), and the leak did not occur. – Xavier Portebois Aug 24 '20 at 08:11
  • From the code I don't find anything suspicious. Could you retest on an other WildFly version? Also, you might want to consider filing a JIRA issue for this. Personally I've found a memory leak in EJB a couple of WildFly versions ago, so mistakes do happen. – Jan-Willem Gmelig Meyling Aug 25 '20 at 11:04
  • All buffer handling is done inside Undertow. You have no access to allocation and release of resources, right? My guess is that the problem is on the client side. Try a couple of clients. One client lives a long life and sends tons of messages. The second client tries to send a 2 Gb video over websocket. My bet is that it is the second type of clients that might overload your system. Could you limit the size of the message on the client side? – Pak Uula Aug 26 '20 at 02:01
  • To be fair, the feature is neither used (we removed it from now to avoid OoM on production environment), and users can only send simple text messages. Nevertheless, as Gatling could not reproduce the issue, I suspect some clients handling their websockets curiously. – Xavier Portebois Aug 26 '20 at 06:34
  • @XavierPortebois, any update on this? We are facing the same issue. – Victor Bello Jun 02 '21 at 22:49
  • @VictorBello Halas no, we had to disable websockets :-/ – Xavier Portebois Jun 03 '21 at 08:56
  • @XavierPortebois We have exactly the same problem. What JVM are you using? We recently updated to Wildfly 18 and Java 11 I think this could be the reason in our case...How did you disable your websockets? Remove the `` entry from standalone.xml? We do not even use websockets ... – Lonzak Oct 29 '21 at 10:09

2 Answers2

0

Looking at this post this may happen if you have a lot of CPU. This was solved by decreasing the number of IO workers. Not sure if this can help in your case.

Cyril G.
  • 1,879
  • 2
  • 5
  • 19
0

I also had a similar problem on our wildfly 18 (wildfly 19 suffers from it, too). It is probably triggered by a faulty xnio lib inside wildfly. After updating to wildfly 22 (using the latest xnio lib) the problem was gone.

Lonzak
  • 9,334
  • 5
  • 57
  • 88