2

The German language sometimes uses words with absurd length. As long as they are constant at runtime, this is not an issue, careful designers will just apply word wrap where syntactical rules allow. Now we have an application where labels are user-configurable at runtime. I'm aware of 10 good reasons why this is a bad idea, possible solutions need not address this, it is a business requirement.

We wrote a translation tool which is based on a SQL Server database where both UX team and translators can look up strings by partial token matching, implemented simply as

SELECT Literal FROM Translations WHERE LCID=N'de' AND Literal LIKE N'%'+@token+N'%'

I know that this is not SARGable, at the moment it performs reasonably well so a possible solution does not need to address this.

Translators have come up with the habit of including zero-width spaces "​" as hyphenation points in texts like this:

Entgelt​fortzahlungs​anspruch

You can actually copy and paste the spaces from that string, there are two, one after "Entgelt" and one before "anspruch".

At runtime, this works like a charm. At designtime, it has now become impossible to copy+paste "Entgeltfortzahlungsanspruch" (without the invisible spaces!) into the search box of the tool and get results, because the LIKE operator does not ignore these spaces. Which has become a bit of a problem because team members keep adding the same string thinking it has not been included in the glossary yet.

Is there a LIKE syntax jugglery, regex magic, or other ingenious algorithm which would allow to ignore zero-widths spaces in searches, still preserving them in the data?

Previous research: From the four questions that come up when searching for "tsql" and "zero-width space", this one is partly relevant, we actually have implemented part of it because the application can already deal with accent sensitivity/insensitivity. But I could not find any information hinting that collations or TRIMming would help with this kind of inner whitespace.

Cee McSharpface
  • 8,493
  • 3
  • 36
  • 77
  • 1
    BTW, why not use 'SOFT HYPHEN' (U+00AD) instead of 'ZERO WIDTH SPACE' (U+200B)? (Not that it has anything to do with answering your question, I'm just wondering.) – Amadan Jul 30 '18 at 10:49
  • 1
    One approach is to add a (persisted) computed column that contains the data without zero-width spaces and search on that. (The search string should also have zero-width spaces removed.) FWIW, [this](https://stackoverflow.com/questions/35245812/whats-a-good-way-to-trim-all-whitespace-characters-from-a-string-in-t-sql-witho/35247507#35247507) answer demonstrates a way of trimming .NET whitespace characters from a string in TSQL. – HABO Jul 30 '18 at 13:37

1 Answers1

3

Since WHERE col LIKE '%foo%' can't use an index anyway, there's no big downside in WHERE REPLACE(col, '​', '') LIKE '%foo%' (which compares the string in the database but without ZWS with a user-input foo that presumably also won't have ZWS).

(If it's not obvious, the first "empty" string there is not actually empty :D )

It does impose a little bit of a performance hit as all values in col need to be processed at query time. If you can spare the space and the engineering to do so, you can shave off this processing time by setting up another column, which won't have any ZWS in the first place (i.e. have one column with 'Entgelt​fortzahlungs​anspruch' and one column with 'Entgeltfortzahlungsanspruch'). You can do this by explicitly adding both in your client code, or using a TRIGGER to fill the other one by removing ZWS at INSERT or UPDATE time.

Amadan
  • 191,408
  • 23
  • 240
  • 301
  • brilliant. My final expression is `REPLACE(sLiteral, nchar(173), N'') LIKE @token --U+00AD soft hyphen` so future generations are less tempted to optimize it away. – Cee McSharpface Jul 30 '18 at 23:14