5

I tested this with 2.14.0 and 2.13.3 I used the JDBC Appender in combination with the DynamicThresholdFilter and tried out a normal Logger and also the AsyncLogger.

In the JDBC Appender i also tried out the PoolingDriver and the ConnectionFactory approach.

It turns out that the Threads are not started parallel because of Log4j2.

Using the AsyncLogger even made it worse since the Output said that the Appender is not started and of 15.000 expected logs only 13.517 are in the DB.

To reproduce the issue i made a github repo see here: https://github.com/stefanwendelmann/Log4j_JDBC_Test

Threads

EDIT

I replaced the mssql-jdbc with a h2db and the threads dont block. JMC auto analysis say that there are locking instances of JdbcDatabaseManager.

Is there any configuration problem in my PoolableConnectionFactory for mssql-jdbc or is there a general problem with dbcp / jdbc driver pooling?

Edit 2

Created Ticket on Apaches LOGJ2 Jira: https://issues.apache.org/jira/browse/LOG4J2-3022

Edit 3

Added a longer flight recording for mssql and h2:file

https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_docker_mssql_asynclogger_10000_runs.jfr

https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_h2_file_asynclogger_10000_runs.jfr

kolossus
  • 20,559
  • 3
  • 52
  • 104
Stefan Höltker
  • 311
  • 5
  • 21
  • Are you referring to `pool-2-thread-{1-8}` in your GIF, where I see that those 8 threads are basically taking turns, running one at a time? And what about the ~1,500 lost logs in asynchronous mode? Are they lost due to the queue limit being reached, or maybe because your test JVM terminates before all logs could be written to the database? I have not cloned your repo or studied your test yet and am kind of reluctant to install MS SQL just for that. Some up front information would be nice or maybe a sample configuration for a lightweight RDBMS (H2? MySQL?) in your GitHub repo. – kriegaex Feb 14 '21 at 06:24
  • Capture a flight recording and share here. **A lot** goes into thread scheduling, even beyond your code or the library; the number of available processors, what the processor is spending its time on, if the threads are blocking. This is largely unanswerable using just what you have here. – kolossus Feb 14 '21 at 17:54
  • @kolossus, to be fair, the OP did share a full GitHub project. If you don't mind to install SQL Server or modify the code to use another RDBMS, I think you can see for yourself. I do not want to install a multi-GB product like SQL Server, so if Stefan would adjust the sample project to a more lightweight RDBMS, I would certainly take a look. – kriegaex Feb 15 '21 at 02:47
  • @kriegaex fair enough. For the same reason as you've stated, it's too large a time commitment and frankly, I doubt the results would be identical to what OP's experiencing. Better to record exactly what's happening on their machine and upload it for another set of eyes to review – kolossus Feb 15 '21 at 05:49
  • Hi @kolossus i added Systeminfos https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/README.md#testsystem And a flight recording https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording.jfr – Stefan Höltker Feb 15 '21 at 08:45
  • I tried out the h2 db and its much much faster compared to a Standard SQL Server 2017 with a 8.4.1.jre8 mssql-jdbc driver wich takes around 5min, it takes only around 5 sec. with the h2db. I tried out various MS SQL Server with different jdbc driver but its alway slow. Are the PoolableConnectionFactory correctly set ? – Stefan Höltker Feb 15 '21 at 13:08
  • @kriegaex i added a h2 db on file base. I think its a config problem with mssql server? – Stefan Höltker Feb 15 '21 at 13:18
  • I am on the road now, couldn't take a look yet. Thanks though. What makes you say it is an SQL Server config problem? Does it go away when using H2? **Edit:** OK, just saw your question update. So you gained some new insight already by trying another RDBMS, which is good. – kriegaex Feb 15 '21 at 15:29
  • Have you experimented with [`AsyncAppender`](https://logging.apache.org/log4j/log4j-2.3/manual/appenders.html#AsyncAppender) instead of or in combination with `AsyncLogger`? – kriegaex Feb 16 '21 at 06:08
  • @kriegaex yes, i added the results to the repo. AsycAppender only is compareable to without AsycAppender. Both in combination made it worse see the .log file. – Stefan Höltker Feb 16 '21 at 08:06
  • Thanks for the recording @StefanHöltker; I'll give it a look today. – kolossus Feb 16 '21 at 15:57
  • @StefanHöltker is there a connection pool in your setup? I'm having a look at your flight recording right now; there are just two thread dumps in it, and only one of the dumps has a single relevant stacktrace. Multiple thread dumps will become necessary to prove the point of contention. So far, all the methods in the relevant stacktrace point to an unholy chain of `synchronize`d methods. An u – kolossus Feb 16 '21 at 23:12
  • @kolossus yes the Pooling should be handled in the log4j2.xml in the PoolingDriver -> PoolableConnectionFactory – Stefan Höltker Feb 17 '21 at 07:23
  • 1
    I'm afraid more relevant thread dumps will be necessary to observe what each thread is waiting for when it blocks. If you get the chance to run another flight recording, make sure that it runs for long enough, so that it has multiple thread dumps in it featuring all the pooled connection threads. The current recording has just one dump in it, with just a single pooled thread @StefanHöltker – kolossus Feb 17 '21 at 13:33
  • @kolossus i will make a longer flight record tomorrow and upload it and inform you :) – Stefan Höltker Feb 17 '21 at 16:11
  • 1
    @StefanHöltker, maybe you want to put the alternative DBMS logging options such as connection string into the Log4J config file as comments instead of forcing someone working on this question or the related Log4J ticket to sift through the commit history in order to see what you did. You better make it easy for the Log4J folks to compare the two situations MS SQL vs. H2. Also, in the ticket description there you say "H2 in memory", i.e. comparing in memory to something involving I/O, a bit like apples vs. oranges. Maybe you want to explain that you also used H2 in write-to-file mode. – kriegaex Feb 18 '21 at 01:21
  • @kriegaex thanks for the info, i updated the ticket and repo – Stefan Höltker Feb 18 '21 at 13:54
  • @kolossus i added the longer flight recordings, see Edit 3 – Stefan Höltker Feb 18 '21 at 13:54

2 Answers2

1

Thanks for getting the flight recordings up. This is a pretty interesting scenario, but I'm afraid I can't give conclusive answers, mostly because for some reason

  1. The information in your flight recordings is weirdly incomplete. I'll explain a little more shortly
  2. There seems to be other things going on in your system that may be muddying the diagnosis. You might benefit from killing any other running process in your machine.

So, what now/(TL;DR)

  1. You need to be sure that your connection source to the database is pooled
  2. Make sure you start your load test on a calm, clear-headed CPU
  3. Configure your next flight recording to take sufficient, intermittent thread dumps. This is probably the most important next step, if you're interested in figuring out what exactly all these threads are waiting for. Don't post up another flight recording until you're positive it contains multiple thread dumps that feature all the live threads in your JVM.
  4. Maybe 10k threads isn't reasonable for your local machine
  5. I also noticed from the flight recording that you have a heap size maxed at 7GB. If you're not on a 64bit OS, that could actually be harmful. A 32-bit OS can address a max of 4GB.
  6. Make sure there aren't any actual database failures causing the whole thing to thrash. Are you running out of connections? Are there any SQLExceptions blowing up somewhere? Any exceptions at all?

Here's what I could tell from your recordings:

CPU

Both flight recordings show that your CPU was struggling for good chunks of both of your recordings:

The MSSQL recording (46 mins total)

enter image description here

JFR even warns in the MSSQL recording that:

An average CPU load of 42 % was caused by other processes during 1 min 17 s starting at 2/18/21 7:28:58 AM.

The H2 recording (20.3s total)enter image description here

I noticed that your flight recordings are titled XXXX_10000. If this means "10k concurrent requests", it may simply mean that your machine simply can't deal with the load you're putting on it. You may also benefit from first ensuring that your cores don't have a bunch of other things hogging their time before you kick off another test. At any rate, hitting 100% CPU utilization is bound to cause lock contention as a matter of course, due to context switching. Your flight recording shows that you're running on an 8-core machine; but you noted that you're running a dockerized MSSQL. How many cores did you allocate to Docker?

Blocked Threads

There's a tonne of blocking in your setup, and there are smoking guns everywhere. The thread identified by Log4j2-TF-1-AsyncLoggerConfig-1 was blocked a lot by the garbage collector, just as the CPU was thrashing:

The H2 flight recording:

enter image description here

All but the last 3 ticks across that graph were blockings of the log4j2 thread. There was still significant blocking of the other pooled threads by GC (more on that further down)

The MSSQL flight recording had smoother GC, but both flight recordings featured blocking by GC and the consequent super high CPU utilization. One thing was clear from the MSSQL and H2 recording: every other pooled thread was blocked, waiting for a lock on the same object ID

For MSSQL, lock ID: 0x1304EA4F#40; for H2, lock ID: 0x21A1100D7D0

Every thread except the main thread and pool-1-thread-1 (which was blocked by garbage collection) exhibits this behavior.

These 7 threads are all waiting for the same object. There is definitely some blocking or even a deadlock somewhere in your setup.

The small specks of green also corroborate the intermittent transfer of monitor locks between the various threads, confirming that they're sort of deadlocked. The pane that shows the threads at the bottom gives a timeline of each thread's blockage. Red indicates blocked; green indicates running. If you hover over each thread's red portion, it shows you that

  1. The thread is blocked, waiting to acquire a lock (RED)
  2. The ID of the lock that the thread is trying to acquire and is currently unable
  3. The ID of the thread that last held the lock
  4. Green indicates a running, unblocked thread.

When you hover over the red slices in your flight recording, you'll see that they're all waiting to acquire the same lock. That lock is intermittently held between the various pooled threads.

MSSQL (threads blocked waiting for 0x1304EA4F40):

enter image description here

H2 (threads blocked waiting for 0x21A1100D7D0):

enter image description here

In both flight recordings, pool-1-thread-1 is the sole thread that isn't blocked while trying to acquire a lock. That blank row for pool-1-thread-1 is solely due to garbage collection, which I covered earlier.

Dumps

Ideally, your flight recordings should contain a bunch of thread dumps, especially the one that you ran for over 40 mins; never mind the 20s one. Unfortunately, both recordings contain just 2 recordings each; only one of each of them even contains the stacktrace for pool-1-thread-1. Singular thread dumps are worthless. You'll need multiple snapshots over a length of time to make use of them. With a thread dump (or a heap dump), one could identify what objects the IDs 0x1304EA4F40 and 0x21A1100D7D0 refer to. The most I could figure out from the dumps is that they're all waiting for an instance of "Object":

enter image description here

It literally could be anything. Your very first flight recording at least showed that the threads were locked on org.apache.logging.log4j.core.appender.db.jdbc.JdbcDatabaseManager:

enter image description here

That very first recording shows the same pattern in the locks pane, that all the threads were waiting for that single object:

enter image description here

That first recording also shows us what pool-1-thread-1 was up to at that one instant:

enter image description here

From there, I would hazard a guess that that thread was in the middle of closing a database connection? Nothing conclusive can be said until multiple successive thread dumps show the thread activity over a span of time.

kolossus
  • 20,559
  • 3
  • 52
  • 104
0

I tested on MySQL db and I found lock on following method:

org.apache.logging.log4j.core.appender.db.AbstractDatabaseManager.write(org.apache.logging.log4j.core.LogEvent, java.io.Serializable) (line: 261)

because in the source code you can see synchronization on write method:

    /**
     * This method manages buffering and writing of events.
     *
     * @param event The event to write to the database.
     * @param serializable Serializable event
     */
    public final synchronized void write(final LogEvent event, final Serializable serializable) {
        if (isBuffered()) {
            buffer(event);
        } else {
            writeThrough(event, serializable);
        }
    }

I think if you specify buffer size it will increase throughput, because logs will be collected into batches and synchronization will be pretty low.

After updating log4j2 config file on using AsyncLogger you will see lock on:

org.apache.logging.log4j.core.async.AsyncLoggerConfigDisruptor.enqueue(org.apache.logging.log4j.core.LogEvent, org.apache.logging.log4j.core.async.AsyncLoggerConfig) (line: 375)

and implementation of that method:

private void enqueue(final LogEvent logEvent, final AsyncLoggerConfig asyncLoggerConfig) {
        if (synchronizeEnqueueWhenQueueFull()) {
            synchronized (queueFullEnqueueLock) {
                disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
            }
        } else {
            disruptor.getRingBuffer().publishEvent(translator, logEvent, asyncLoggerConfig);
        }
    }

synchronizeEnqueueWhenQueueFull is true by default, and it produces locks on threads, you can manage these parameters:

    /**
     * LOG4J2-2606: Users encountered excessive CPU utilization with Disruptor v3.4.2 when the application
     * was logging more than the underlying appender could keep up with and the ringbuffer became full,
     * especially when the number of application threads vastly outnumbered the number of cores.
     * CPU utilization is significantly reduced by restricting access to the enqueue operation.
     */
    static final boolean ASYNC_LOGGER_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
            .getBooleanProperty("AsyncLogger.SynchronizeEnqueueWhenQueueFull", true);
    static final boolean ASYNC_CONFIG_SYNCHRONIZE_ENQUEUE_WHEN_QUEUE_FULL = PropertiesUtil.getProperties()
            .getBooleanProperty("AsyncLoggerConfig.SynchronizeEnqueueWhenQueueFull", true);

But you should know about side effect of using these parameters, as mentioned in code snippet.

Ideas why can a DB become a bottleneck?:

  • remoteness db(vpn and etc.)
  • check what strategy is used for id column (SEQUENCE, TABLE, IDENTITY) to avoid additional db call
  • are there indexes on columns? (it can produce reindex operation on each transaction commit)
saver
  • 2,541
  • 1
  • 9
  • 14
  • thanks for your answer. But why is it so much faster with the h2 db ? I am running a docker host on my laptop with mssql server on it. Switching to another server on a remote server the logging is as slow as locally. I tried out "newsequentialid()" and identity, both slow. No other indexes as shown in the example schema_mssql.sql ? Maybe the jdbc driver or dbcp ? I will test ist out with a postgresql tomorrow and then open a ticket on log4j's jira. – Stefan Höltker Feb 15 '21 at 20:50
  • Using AsyncLogger i get errors and not all logs are in the DB so they are lost. The error Log4j2-TF-1-AsyncLoggerConfig-1 ERROR Attempted to append to non-started appender Lauflog , so i assume the Appender is closed by another thread or smth? – Stefan Höltker Feb 16 '21 at 09:42
  • postgresql is a bit faster but the JdbcDatabaseManager is also blocking – Stefan Höltker Feb 16 '21 at 09:43
  • Using a AsyncLogger with a "-Dlog4j2.asyncLoggerConfigRingBufferSize=524288" , the program exits ssems to exit before everything is written to the DB 2021-02-16 13:18:09,809 pool-1-thread-1 WARN AsyncLoggerConfigDisruptor: shutdown timed out after 0 MILLISECONDS 2021-02-16 13:18:09,822 Log4j2-TF-1-AsyncLoggerConfig-1 ERROR Attempted to append to non-started appender Lauflog – Stefan Höltker Feb 16 '21 at 12:32
  • Have you tried something as simple as waiting for the asynchronous logs to be written in your application (or test) before shutting it down? I mean `LogManager.shutdown()`. Chances are, then you would no longer lose log entries in the DB and see strange errors like the one above. The errors seem to hint at race conditions between application and logging threads. If first you `Thread.join()` on logging application threads and then shut down the log manager, in theory you should be fine. If this helps, let me know and I can convert it into an answer. – kriegaex Feb 17 '21 at 03:05
  • Hi @kriegaex it gets even worse. I got errors with "java.sql.SQLException: Pool not registered: example". All threads are finished and the Logmanager shutdown interrupts the writhing to the db " AsyncLoggerConfigDisruptor: shutdown timed out after 50 MILLISECONDS" And over all the main problem is not solved, i expected there is something like a DBCP on the JDBC Logger, but it seems there is only one Thread writing to the DB. – Stefan Höltker Feb 17 '21 at 07:36