How to map SQL collation setting to a Java comparator?

Question

Is there a way to translate a database's collation setting (e.g. SQL_Latin1_General_CP1_CI_AS) to a Java Comparator implementation so I can apply the same ordering as the database does, using in Java code?

Is there an existing library that already provides this mapping?

I haven't seen such that library but you can query the **master..spt_values** (using _collate SQL_Latin1_General_CP1_CI_AS_ for _char(number)_) to see all the values and their sort orders now you can write your customized Comparator — void, Aug 20 '15 at 01:33
As I understand, the question is only about MS SQL Server, not about other database providers. — iuzuz, Nov 23 '18 at 10:56

score 2 · Answer 1 · edited Apr 13 '17 at 12:42

Simplistically, you can use the COLLATIONPROPERTY function, which gives you:

CodePage
LCID
ComparisonStyle
Version

The ComparisonStyle is a bit-masked field that is encoded as follows:

Case insensitivity (IgnoreCase) = 1
Accent insensitivity (IgnoreNonSpace) = 2,
Kana type insensitivity (IgnoreKanaType) = 65536
Width insensitivity (IgnoreWidth) = 131072

Unfortunately, everything-sensitive (e.g. Latin1_General_CS_AS_KS_WS ) equates to 0. This is unfortunate because both _BIN and _BIN2 collations also equate to 0. Hence you still need to check the name to see if it ends in _BIN% to get the full picture.

But, this is not so simple. There are two main types of Collations: SQL Server collations and Windows Collations.

The SQL Server collations (i.e. starting with SQL_ ) are deprecated and should not be used anymore, though a lot of systems to default to SQL_Latin1_General_CP1_CI_AS.

For both types of collations, NCHAR / NVARCHAR / XML data uses the Unicode sorting algorithms. For non-Unicode data, the Windows collations should sort the same between SQL Server and .NET. However, for the SQL Server collations, the sorting algorithm does not necessarily match to the Windows collation (or possibly anything). But they do have their own Sort Order IDs and there might be public documentation describing those rules.

The Windows collations have several variations:

differing versions: unspecified should be the original set, then the first set of updates are labeled _90 and the newest updates are the _100 series.
differing binary ordering: the older _BIN collations do not map to anything exactly in .NET since they compared the first character as a character. The newer _BIN2 collations are pure code-point comparisons and ordering and should map to the ordinal ComparisonStyle.

Beyond the specifics of any particular collation, there is another factor complicating what you are trying to accomplish: the default collation for a database does not necessarily determine the collation used for sorting / comparing a particular predicate or field! The collation can be taken from the field being operated on, it can be taken from the database default for string literals and variables, or it can be overridden in both cases via the COLLATE clause. Please see the MSDN page for Collation Precedence for more details.

In the end, there is no deterministic means of getting the collation(s) used because each predicate in a WHERE clause could potentially use a different collation, and that can be different from the collation used in the ORDER BY, and JOIN conditions (and GROUP BY, etc) can have their collations.

But to simplify a little:

If the data is non-Unicode, check the Code Page for the specified locale / LCID. Then use that to create the same Encoding in .NET.
If the data is Unicode and not using a _BIN collation then it should match the same settings in .NET. Again, the _BIN2 collation should match the ordinal ComparisonStyle.
If the data is non-Unicode with a SQL Server collation or Windows _BIN collation, then cross your fingers, rub a lucky rabbit's foot (though not so lucky for the rabbit), etc.

But wait, there's more! Seriously.

You need to also consider:

as with any standard, it is up to the implementer to follow the spec. That doesn't always happen. So even with what should be a truly equivalent collation between SQL Server and your Java app, and even if there are no issues with Collation Precedence, there can still be differences in sorting and comparisons. For an example, check out my "update" on this answer on DBA.StackExchange: Why does MS SQL Server return a result for empty string check when Unicode string is not empty
If you are transferring data between .NET and Java, keep in mind that Java is UTF-16 Big Endian while .NET is UTF-16 Little Endian.

Gili · Accepted Answer · 2015-08-25T16:50:39.193

I ended up doing the following:

Query the current database's collation setting.
Next, parse the description of the collator into sub-components such as "case-insensitive" or "accent-sensitive".
Next, construct a Comparator corresponding to these rules

Enjoy!

/**
 * Returns the Comparator associated with the database's default collation.
 * <p>
 * Beware! <a href="http://stackoverflow.com/a/361059/14731">Some databases</a> sort unicode strings differently than
 * non-unicode strings, even for the same collation setting.
 * <p>
 * @param unicode true if the String being sorted is unicode, false otherwise
 * @return the Comparator associated with the database's default collation
 * @throws DatabaseException if an unexpected database error occurs
 */
public Comparator<String> getComparator(boolean unicode)
    throws DatabaseException
{
    // @see http://stackoverflow.com/a/5072926/14731, http://stackoverflow.com/a/27052010/14731 and
    // http://stackoverflow.com/q/32209137/14731
    try (Connection connection = server.getDatasource().getConnection())
    {
        try (PreparedStatement statement = connection.prepareStatement(
            "SELECT description from sys.fn_HelpCollations()\n" +
            "WHERE name = SERVERPROPERTY('collation')"))
        {
            try (ResultSet rs = statement.executeQuery())
            {
                if (!rs.next())
                    throw new ObjectNotFoundException(this);
                String description = rs.getString(1);
                List<String> tokens = Arrays.asList(description.split(",\\s*"));
                // Description format: language,property1,property2,...,propertyN,sorting,...
                ComparatorBuilder comparatorBuilder = new ComparatorBuilder();

                // Skip the language
                tokens = tokens.subList(1, tokens.size());
                // See https://technet.microsoft.com/en-US/library/ms143515(v=SQL.90).aspx for a list of possible tokens
                for (String token: tokens)
                {
                    if (token.toLowerCase().contains("sort"))
                    {
                        // Stop as soon as we hit information related to the sorting order
                        break;
                    }
                    switch (token)
                    {
                        case "case-insensitive":
                        {
                            comparatorBuilder.caseInsensitive(true);
                            break;
                        }
                        case "accent-insensitive":
                        {
                            comparatorBuilder.accentInsensitive(true);
                            break;
                        }
                        case "kanatype-insensitive":
                        {
                            comparatorBuilder.kanaInsensitive(true);
                            break;
                        }
                        case "width-insensitive":
                        case "width-insensitive for Unicode Data":
                        {
                            comparatorBuilder.widthInsensitive(true);
                            break;
                        }
                        case "case-sensitive":
                        case "accent-sensitive":
                        case "kanatype-sensitive":
                        case "width-sensitive":
                        {
                            // Do nothing, this is the default setting.
                            break;
                        }
                        default:
                            throw new AssertionError(String.format("Unexpected token: '%s'. Description: '%s'", token, description));
                    }
                }
                assert (!rs.next()): "Database returned more rows than expected";
                if (unicode)
                    comparatorBuilder.discardHyphens(true);
                return comparatorBuilder.build();
            }
        }
    }
    catch (SQLException e)
    {
        throw new DatabaseException(e);
    }
}

import com.ibm.icu.text.Transliterator;
import java.text.Normalizer;
import java.util.Comparator;

/**
 * Converts a database collation to a Java comparator.
 * <p>
 * @see https://msdn.microsoft.com/en-us/library/hh230914.aspx?f=255&MSPPError=-2147217396
 * @see http://zarez.net/?p=1893
 * @author Gili Tzabari
 */
class ComparatorBuilder
{
    // SQL Server: https://technet.microsoft.com/en-US/library/ms143515(v=SQL.90).aspx
    private boolean caseInsensitive = false;
    private boolean accentInsensitive = false;
    private boolean kanaInsensitive = false;
    private boolean widthInsensitive = false;
    /**
     * Indicates if hyphens should be discarded prior to sorting (default = false).
     */
    private boolean discardHyphens = false;

    /**
     * @return true if the comparator ignores the difference between uppercase and lowercase letters (default = false)
     */
    public boolean caseInsensitive()
    {
        return caseInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between uppercase and lowercase letters
     * @return this
     */
    public ComparatorBuilder caseInsensitive(boolean value)
    {
        this.caseInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator ignores the difference between accented and unaccented characters (default = false)
     */
    public boolean accentInsensitive()
    {
        return accentInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between accented and unaccented characters
     * @return this
     */
    public ComparatorBuilder accentInsensitive(boolean value)
    {
        this.accentInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator ignores the difference between the two types of Japanese kana characters: Hiragana
     *         and Katakana (default = false)
     */
    public boolean kanaInsensitive()
    {
        return kanaInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between the two types of Japanese kana characters:
     *              Hiragana and Katakana
     * @return this
     */
    public ComparatorBuilder kanaInsensitive(boolean value)
    {
        this.kanaInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator ignores the difference between a single-byte character and the same character when
     *         represented as a double-byte character (default = false)
     */
    public boolean widthInsensitive()
    {
        return widthInsensitive;
    }

    /**
     * @param value true if the comparator ignores the difference between a single-byte character and the same character
     *              when represented as a double-byte character
     * @return this
     */
    public ComparatorBuilder widthInsensitive(boolean value)
    {
        this.widthInsensitive = value;
        return this;
    }

    /**
     * @return true if the comparator discards hyphens prior to sorting (default = false)
     */
    public boolean discardHyphens()
    {
        return discardHyphens;
    }

    /**
     * @param value true if comparator discards hyphens prior to sorting
     * @return this
     */
    public ComparatorBuilder discardHyphens(boolean value)
    {
        this.discardHyphens = value;
        return this;
    }

    /**
     * @return a Comparator instance
     */
    public Comparator<String> build()
    {
        return (java.lang.String first, java.lang.String second) ->
        {
            String firstNormalized = first;
            String secondNormalized = second;
            if (discardHyphens)
            {
                firstNormalized = firstNormalized.replaceAll("-", "");
                secondNormalized = secondNormalized.replaceAll("-", "");
            }
            if (accentInsensitive)
            {
                // @see http://stackoverflow.com/a/3322174/14731
                firstNormalized = Normalizer.normalize(first, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
                secondNormalized = Normalizer.normalize(second, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
            }
            if (kanaInsensitive)
            {
                // @see http://stackoverflow.com/a/6577778/14731
                Transliterator transliterator = Transliterator.getInstance("Hiragana-Katakana");
                firstNormalized = transliterator.transliterate(firstNormalized);
                secondNormalized = transliterator.transliterate(secondNormalized);
            }
            if (widthInsensitive)
            {
                Transliterator transliterator = Transliterator.getInstance("Halfwidth-Fullwidth");
                firstNormalized = transliterator.transliterate(firstNormalized);
                secondNormalized = transliterator.transliterate(secondNormalized);
            }
            // Case-normalization is not as easy as it seems. See
            // http://mattryall.net/blog/2009/02/the-infamous-turkish-locale-bug and the implementation of
            // String.compareToIgnoreCase(). Better to delegate to a trusted implementation.
            if (caseInsensitive)
                return firstNormalized.compareToIgnoreCase(secondNormalized);
            else
                return firstNormalized.compareTo(secondNormalized);
        };
    }
}

The String.compareTo() compares only using the normal < operator for characters. It doesn't consider the language specific order of characters in the database collation. I think, you need to use the java Collator for the right language determined from the database collation name. — iuzuz, Nov 23 '18 at 11:04
@iuzuz can you provide a concrete example where the character order does not correspond to the order used by the database? — Gili, Nov 24 '18 at 11:59
Firstly, I forgot to say, the approach is good. I would do is a similar way :) The requested example would be the German ä. I did not tested it, but, I'm quite sure, it will be ordered differently. Another hint: in the code, you did not consider the first part of the split result: the language. And depending of it the collation is different. — iuzuz, Jan 04 '19 at 12:24
@iuzuz I'm no longer actively working on this project anymore so it's hard for me to validate the case you mentioned or continue to improve the code. That said, I will accept edits and other answers that improve on this initial implementation. Thank you. — Gili, Jan 04 '19 at 19:14

How to map SQL collation setting to a Java comparator?

2 Answers2