Why is reading a JDBC ResultSet by position faster than by name and how much faster?

Question

Announcing Hibernate 6 the Hibernate team claims that by switching from read-by-name to read-by-position in JDBC ResultSet they gain a performance benefit.

High-load performance testing showed that Hibernate’s approach of reading values from ResultSet by name to be its most limiting factor in scaling through-put.

Does that mean they are changing calls from getString(String columnLabel) to getString(int columnIndex)?

Why is this faster?

As ResultSet is an interface doesn't performance gain depend on the JDBC driver implementing it?

How big are the gains?

Mark Rotteveel · Accepted Answer · 2019-04-05T09:39:43.133

Speaking as a JDBC driver maintainer (and, I admit, making some sweeping generalizations which not necessarily apply to all JDBC driver), row values will usually be stored in an array or list because that most naturally matches the way the data is received from the database server.

As a result, retrieving values by index will be the simplest. It might be as simple as something like (ignoring some of the nastier details of implementing a JDBC driver):

public Object getObject(int index) throws SQLException {
    checkValidRow();
    checkValidIndex(index);
    return currentRow[index - 1];
}

This is about as fast as it gets.

On the other hand, looking up by column name is more work. Column names need to be treated case-insensitive, which has additional cost whether you normalize using lower or uppercase, or use a case-insensitive lookup using a TreeMap.

A simple implementation might be something like:

public Object getObject(String columnLabel) throws SQLException {
    return getObject(getIndexByLabel(columnLabel));
}

private int getIndexByLabel(String columnLabel) {
    Map<String, Integer> indexMap = createOrGetIndexMap();
    Integer columnIndex = indexMap.get(columnLabel.toLowerCase());
    if (columnIndex == null) {
        throw new SQLException("Column label " + columnLabel + " does not exist in the result set");
    }
    return columnIndex;
}

private Map<String, Integer> createOrGetIndexMap() throws SQLException {
    if (this.indexMap != null) {
        return this.indexMap;
    }
    ResultSetMetaData rsmd = getMetaData();
    Map<String, Integer> map = new HashMap<>(rsmd.getColumnCount());
    // reverse loop to ensure first occurrence of a column label is retained
    for (int idx = rsmd.getColumnCount(); idx > 0; idx--) {
        String label = rsmd.getColumnLabel(idx).toLowerCase();
        map.put(label, idx);
    }
    return this.indexMap = map;
}

Depending on the API of the database and available statement metadata, it may require additional processing to determine the actual column labels of a query. Depending on the cost, this will likely only be determined when it is actually needed (when accessing column labels by name, or when retrieving result set metadata). In other words, the cost of createOrGetIndexMap() might be pretty high.

But even if that cost is negligible (eg the statement prepare metadata from the database server includes the column labels), the overhead of mapping the column label to index and then retrieving by index is obviously higher than directly retrieving by index.

Drivers could even just loop over the result set metadata each time and use the first whose label matches; this might be cheaper than building and accessing the hash map for result sets with a small number of columns, but the cost is still higher than direct access by index.

As I said, this is a sweeping generalization, but I would be surprised if this (lookup index by name, then retrieve by index) isn't how it works in the majority of JDBC drivers, which means that I expect that lookup by index will generally be quicker.

Taking a quick look at a number of drivers, this is the case for:

Firebird (Jaybird, disclosure: I maintain this driver)
MySQL (MySQL Connector/J)
PostgreSQL
Oracle
HSQLDB
SQL Server (Microsoft JDBC Driver for SQL Server)

I'm not aware of JDBC drivers where retrieval by column name would be equivalent in cost or even cheaper.

Great explanation. I wonder if using the *requested* column name as the key in the map instead of doing it upfront using the lowercase column name (thus constructing the map lazily, adding an entry each time a getXxx() method is called) wouldn't make it faster for big result sets: it would avoid having to lowercase the same keys again and again for each row of the resultset. — JB Nizet, Apr 05 '19 at 09:47
@JBNizet Possibly. That is what we actually [do in Jaybird](https://github.com/FirebirdSQL/jaybird/blob/master/src/main/org/firebirdsql/jdbc/AbstractResultSet.java#L630), but I didn't want to clutter things here :) — Mark Rotteveel, Apr 05 '19 at 09:52

score 11 · Answer 2 · answered May 05 '22 at 12:29

In the very early days of making jOOQ, I had considered both options, of accessing JDBC ResultSet values by index or by name. I chose accessing things by index for these reasons:

RDBMS support

Not all JDBC drivers actually support accessing columns by name. I forgot which ones didn't, and if they still don't, because I never touched that part of JDBC's API again in 13 years. But some didn't and that was already a show stopper for me.

Semantics of the name

Furthermore, among those that do support column names, there are different semantics to a column name, mainly two, what JDBC calls:

The column name as in ResultSetMetaData::getColumnName
The column label as in ResultSetMetaData::getColumnLabel

There is a lot of ambiguity with respect to implementations of the above two, although I think the intent is quite clear:

The column name is supposed to produce the name of the column irrespective of aliasing, e.g. TITLE if the projected expression is BOOK.TITLE AS X
The column label is supposed to produce the label (or alias) of the column, or the name if no alias is available, e.g. X if the projected expression is BOOK.TITLE AS X

So, this ambiguity of what a name/label is is already very confusing and concerning. It doesn't seem something an ORM should rely on in general, although, in Hibernate's case, one can argue that Hibernate is in control of most SQL being generated, at least the SQL that is produced to fetch entities. But if a user writes an HQL or native SQL query, I would be reluctant to rely on the name/label - at least without looking things up in ResultSetMetaData, first.

Ambiguities

In SQL, it's perfectly fine to have ambiguous column names at the top level, e.g.:

SELECT id, id, not_the_id AS id
FROM book

This is perfectly valid SQL. You can't nest this query as a derived table, where ambiguities aren't allowed, but in top level SELECT you can. Now, what are you going to do with those duplicate ID labels at the top level? You can't know for sure which one you'll get when accessing things by name. The first two may be identical, but the third one is very different.

The only way to clearly distinguish between the columns is by index, which is unique: 1, 2, 3.

Performance

I had also tried performance at the time. I don't have the benchmark results anymore, but it's easy to write another benchmark quickly. In the below benchmark, I'm running a simple query on an H2 in-memory instance, and consume the ResultSet accessing things:

By index
By name

The results are staggering:

Benchmark                            Mode  Cnt        Score       Error  Units
JDBCResultSetBenchmark.indexAccess  thrpt    7  1130734.076 ±  9035.404  ops/s
JDBCResultSetBenchmark.nameAccess   thrpt    7   600540.553 ± 13217.954  ops/s

Despite the benchmark running an entire query on each invocation, the access by index is almost twice as fast! You can look at H2's code, it's open source. It does this (version 2.1.212):

private int getColumnIndex(String columnLabel) {
    checkClosed();
    if (columnLabel == null) {
        throw DbException.getInvalidValueException("columnLabel", null);
    }
    if (columnCount >= 3) {
        // use a hash table if more than 2 columns
        if (columnLabelMap == null) {
            HashMap<String, Integer> map = new HashMap<>();
            // [ ... ]

            columnLabelMap = map;
            if (preparedStatement != null) {
                preparedStatement.setCachedColumnLabelMap(columnLabelMap);
            }
        }
        Integer index = columnLabelMap.get(StringUtils.toUpperEnglish(columnLabel));
        if (index == null) {
            throw DbException.get(ErrorCode.COLUMN_NOT_FOUND_1, columnLabel);
        }
        return index + 1;
    }
    // [ ... ]

So. there's a hashmap with upper casing, and each lookup also performs upper casing. At least, it caches the map in the prepared statement, so:

You can reuse it on every row
You can reuse it on multiple executions of the statement (at least that's how I interpret the code)

So, for very large result sets, it might not matter as much anymore, but for small ones, it definitely does.

Conclusion for ORMs

An ORM like Hibernate or jOOQ is in control of a lot of SQL and the result set. It knows exactly what column is at what position, this work has already been done when generating the SQL query. So, there's absolutely no reason to rely on the column name any further when the result set comes back from the database server. Every value will be at the expected position.

Using column names must have been some historic thing in Hibernate. It's probably also why they used to generate these not so readable column aliases, to make sure that each alias is non-ambiguous.

It seems like an obvious improvement, irrespective of the actual gains in a real world (non-benchmark) query. Even if the improvement had been only 2%, it would have been worth it, because it affects every query execution by every Hibernate based application.

Benchmark code below, for reproduction

package org.jooq.test.benchmarks.local;

import java.io.*;
import java.sql.*;
import java.util.Properties;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.*;

@Fork(value = 1)
@Warmup(iterations = 3, time = 3)
@Measurement(iterations = 7, time = 3)
public class JDBCResultSetBenchmark {

    @State(Scope.Benchmark)
    public static class BenchmarkState {

        Connection connection;

        @Setup(Level.Trial)
        public void setup() throws Exception {
            try (InputStream is = BenchmarkState.class.getResourceAsStream("/config.properties")) {
                Properties p = new Properties();
                p.load(is);
                connection = DriverManager.getConnection(
                    p.getProperty("db.url"),
                    p.getProperty("db.username"),
                    p.getProperty("db.password")
                );
            }
        }

        @TearDown(Level.Trial)
        public void teardown() throws Exception {
            connection.close();
        }
    }

    @FunctionalInterface
    interface ThrowingConsumer<T> {
        void accept(T t) throws SQLException;
    }

    private void run(BenchmarkState state, ThrowingConsumer<ResultSet> c) throws SQLException {
        try (Statement s = state.connection.createStatement();
            ResultSet rs = s.executeQuery("select c as c1, c as c2, c as c3, c as c4 from system_range(1, 10) as t(c);")) {
            c.accept(rs);
        }
    }

    @Benchmark
    public void indexAccess(Blackhole blackhole, BenchmarkState state) throws SQLException {
        run(state, rs -> {
            while (rs.next()) {
                blackhole.consume(rs.getInt(1));
                blackhole.consume(rs.getInt(2));
                blackhole.consume(rs.getInt(3));
                blackhole.consume(rs.getInt(4));
            }
        });
    }

    @Benchmark
    public void nameAccess(Blackhole blackhole, BenchmarkState state) throws SQLException {
        run(state, rs -> {
            while (rs.next()) {
                blackhole.consume(rs.getInt("C1"));
                blackhole.consume(rs.getInt("C2"));
                blackhole.consume(rs.getInt("C3"));
                blackhole.consume(rs.getInt("C4"));
            }
        });
    }
}

Regarding "semantics of the name". This used to be not very clear in the JDBC specification either, but was made more explicit that `getXXX(String)` should retrieve by column label, not column name. Unfortunately, backwards compatibility in drivers make the wrong behaviour still stick around for some. — Mark Rotteveel, May 05 '22 at 14:30
As to "ambiguities", JDBC requires the first matching column to be returned, always. So although inconvenient, it should not be ambiguous. — Mark Rotteveel, May 05 '22 at 14:31
@MarkRotteveel: Thanks for the clarifications. While the spec (and possibly the implementations) aren't ambiguous with respect to behaviour, the ambiguity stems from intent. Imagine changing the JOIN order of a `SELECT *` query, or replacing `ON` by `USING`. Suddenly, a query that worked well before changes behaviour, because the order of appearance of `ID` columns now changes. — Lukas Eder, May 05 '22 at 14:37