What's up with these Unicode combining characters and how can we filter them?

Question

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้

These recently showed up in facebook comment sections.

How can we sanitize this?

@AshwiniChaudhary I have done this and what should be the expected output ? It didn't change much... — mas-designs, May 02 '12 at 13:40
Why the closing votes? It's a programming-related question, as I want to know how to sanitize this type of input so the comment sections on my website will not be the 13 years old's playground... — XCS, May 02 '12 at 13:51
กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิ"so the comment sections on my website will not be the 13 years old's playground." Actually without sanitization one posting these characters can make the comment above it unreadable, which is not at all a pleasent user experience. — XCS, May 02 '12 at 22:21
Shouldn't we actually consider it a browser bug? In my opinion, the browser should enlarge the containing box so that all text _including the accents_ fits in and doesn't overflow over/under another boxes — voidengine, May 03 '12 at 11:07
@pjotr It's definetly not a browser bug. If you want the characters not to overflow the containing box you can simply solve that with CSS (overflow:hidden;)... — XCS, May 03 '12 at 11:29
Another post about this particular display issue (just related, not a duplicate): [What's the character encoding used?](http://stackoverflow.com/questions/9310177/whats-the-character-encoding-used) — Pops, May 04 '12 at 20:29
Based on this answer: http://stackoverflow.com/questions/7119115/why-do-those-thai-characters-display-on-the-web-page-with-a-long-tail It DOES look like it may be a browser problem, or even OS. There is a problem with Thai Unicode. — FlipMcF, Mar 14 '13 at 23:05
Related: [How does Zalgo text work?](http://stackoverflow.com/questions/6579844/how-does-zalgo-text-work) — nwellnhof, Jan 28 '14 at 01:43
As a note, it seems that stackoverflow fixed this issue with large unicode characters overlapping other text. — XCS, Oct 14 '17 at 10:44

T.J. Crowder · Accepted Answer · 2012-11-11T08:19:28.050

What's up with these unicode characters?

That's a character with a series of combining characters. Because the combining characters in question want to go above the base character, they stack up (literally). For instance, the case of

ก้้้้้้้้้้้้้้้้้้้้

...it's an ก (Thai character ko kai) (U+0E01) followed by 20 copies of the Thai combining character mai tho (U+0E49).

How can we sanitize this?

You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. More information, code sheets, etc. at unicode.org.

BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts...

Wouldn't you just delete repeated copies of the same combining codepoint back to back into a single copy? When would you ever need to combine the same codepoint onto a base codepoint more than once? — Remy Lebeau, May 02 '12 at 20:43
@RemyLebeau: *"When would you ever need to combine the same codepoint onto a base codepoint more than once?"* I don't know, I know very, very little about how you write other languages -- Thai, for instance. I wouldn't be at all surprised to find out that more than one of the same code point was valid in some. But doing that doesn't reduce the complexity; you still need one of the Unicode tables for figuring out which ones are combining characters. — T.J. Crowder, May 03 '12 at 08:07
I made your page accept the unicode string from the url e.g. http://jsbin.com/erajer/7/?%E0%B8%81%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89%E0%B9%89 — ubershmekel, Mar 12 '13 at 16:04
JavaScript library to easily remove Unicode combining marks from strings: http://mths.be/stripcombiningmarks — Mathias Bynens, Jan 08 '14 at 08:55
JavaScript uses UTF-16 with « [surrogate pairs](https://en.wikipedia.org/wiki/UTF-16#U.2B10000_to_U.2B10FFFF) » — dolmen, Jul 26 '16 at 14:09
@dolmen: UTF-16 always has the possibility of surrogate pairs. What you mean is that [JavaScript tolerates invalid sequences](http://www.ecma-international.org/ecma-262/7.0/index.html#sec-ecmascript-language-types-string-type), where (of course) UTF-16 does not. — T.J. Crowder, Jul 26 '16 at 14:40

nwellnhof · Answer 2 · 2014-03-09T02:19:46.993

17

If you have a regex engine with decent Unicode support, it's trivial to sanitize this kind of strings. In Perl, for example, you can remove all but the first combining mark from every (user-perceived) character like this:

#!/usr/bin/perl
use strict;
use utf8;

binmode(STDOUT, ':utf8');

my $string = "กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้ กิิ ก้้ ก็็ ก็็ กิิ ก้้ ก็็ กิิ ก้้";
$string =~ s/(\p{Mark})\p{Mark}+/$1/g; # Strip excess combining marks
print("$string\n");

This will print:

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้

edited Mar 09 '14 at 02:19

answered Mar 12 '13 at 18:33

nwellnhof

32,319
7
89
113

9

I can't read Tibetan, but I'm concerned that this brute force approach may remove functionality from the way the language is designed. I've seen unicode that has legitimate use-cases of more than one combining mark. Arabic is a good example. I'll try to remember to run this by my Tibetan co-workers. – FlipMcF Mar 12 '13 at 19:18
2

You're right, there are certainly cases where multiple combining marks are legitimate. But you can easily change the regex to allow a certain maximum of marks. – nwellnhof Mar 12 '13 at 19:45
Upvoted because it does answer the 'how do you sanitize this' question. But I think this would be a maintenance nightmare. – FlipMcF Mar 15 '13 at 00:08
Also, the RE just removes _adjacent_ duplication. It would not clean up, say: `...`. So, if your text needs multiple _different_ combining characters, it will pass through fine; and malicious text could still be built. – Jesse Chisholm Jul 10 '18 at 15:47

score 14 · Answer 3 · edited May 23 '17 at 12:18

14

"How can we sanitize this" is best answered above by T.J Crowder

However, I think sanitization is the wrong approach, and Cristy has it right with overflow:hidden on the css containing element.

At least, that's how I'm solving it.

edited May 23 '17 at 12:18

Community

1
1

answered Mar 12 '13 at 18:00

FlipMcF

12,636
2
35
44

Matas Vaitkevicius · Answer 4 · 2016-03-21T13:18:49.447

Ok this one took me a while to figure out, I was under impression that combining characters to produce zalgo are limited to these. So I expected following regex to catch the freaks.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})

and it didn't work...

The catch is that list in wiki does not cover full range of combining characters.

What gave me a hint is "ก้้้้้้้้้้้้้้้้้้้้".charCodeAt(2).toString(16) = "e49" which in not within a range of combining, it falls into 'Private use'.

In C# they fall under UnicodeCategory.NonSpacingMark and following script flushes them out:

    [Test]
    public void IsZalgo()
    {
        var zalgo = new[] { UnicodeCategory.NonSpacingMark };

        File.Delete("IsModifyLike.html");
        File.AppendAllText("IsModifyLike.html", "<table>");
        for (var i = 0; i < 65535; i++)
        {
            var c = (char)i;
            if (zalgo.Contains(Char.GetUnicodeCategory(c)))
            {


                File.AppendAllText("IsModifyLike.html", string.Format("<tr><td>{0}</td><td>{1}</td><td>{2}</td><td>A&#{3};&#{3};&#{3}</td></tr>\n",  i.ToString("X"), c, Char.GetUnicodeCategory(c), i));

            }
        }
        File.AppendAllText("IsModifyLike.html", "</table>");
    }

By looking at the table generated you should be able to see which ones do stack. One range that is missing on wiki is 06D6-06DC another 0730-0749.

UPDATE:

Here's updated regex that should fish out all the zalgo including ones bypassed in 'normal' range.

([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F\u0483-\u0486\u05C7\u0610-\u061A\u0656-\u065F\u0670\u06D6-\u06ED\u0711\u0730-\u073F\u0743-\u074A\u0F18-\u0F19\u0F35\u0F37\u0F72-\u0F73\u0F7A-\u0F81\u0F84\u0e00-\u0eff\uFC5E-\uFC62]{2,})

The hardest bit is to identify them, once you have done that - there's multitude of solutions including some good ones above.

Hope this saves you some time.

I appreciate your answer, but this is a lost answered question. So why to add new answers unnecessarily? It is just my view. Moreover, your answer is not JavaScript, right? — Praveen Kumar Purushothaman, Mar 17 '16 at 12:42
@PraveenKumar It uncovers why normal zalgo validation `([\u0300–\u036F\u1AB0–\u1AFF\u1DC0–\u1DFF\u20D0–\u20FF\uFE20–\uFE2F]{2,})` does not work. Don't you find it interesting that stacking unicode is not limited to whats on wiki? What do you mean by 'lost answered question'? **EDIT**: You might find it odd to add answer to 3 year old question, but since it took me a while to figure out why this type of zalgo worked I couldn't let such knowledge to go to waste. Next guy will save some time. — Matas Vaitkevicius, Mar 17 '16 at 12:45
@PraveenKumar the question does not state a language, and posting a new answer on an old question is completely appropriate if the old answers were deficient in some way. Unfortunately I do not have enough experience with this problem, or it would get an upvote from me. — Mark Ransom, Mar 21 '16 at 13:25
This RE has the benefit of catching mixed combining characters, with the drawback of never allowing a base that properly does need more than one combining character. — Jesse Chisholm, Jul 10 '18 at 15:54

What's up with these Unicode combining characters and how can we filter them?

4 Answers4

กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้ กิ ก้ ก็ ก็ กิ ก้ ก็ กิ ก้

Linked

Related