How should one reasonably handle combining characters in UTF-8

Question

I am writing a website with a user chat function. At some point a user decided to use diatrics to draw all over everyone's screens.

In response I removed all text that was not in the ASCII character range. I'd like to re-enable UTF-8 but I don't know what to do about the combining marks ( UTF-8 characters that modify the character next to them ). As you can see from the example below, Stack Overflow doesn't handle for this problem.

Malicious input t̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀̀è̀̀̀̀̀̀̀x̀̀̀̀̀̀̀̀̀̀t̀̀̀̀̀̀̀̀̀̀̀̀̀

I feel like only 1 combining mark should be allowed but that seems like a really excessive thing for me to need to write and I don't know if there are any languages that take 2 or 3 combining characters. I imagine Korean uses them extensively.

This seems like it should be a solved problem but I can't any useful information on the topic.

While this might not actually solve the problem with multiple combining characters, to solve you "overdraw" issue, you can add ```css overflow: hidden; ``` to your element containing the user input to avoid the "screen drawing". You could also use the [Transliterator](https://stackoverflow.com/a/35178027/6237870) to strip all diacritics - but I didn't find a way to keep "á" for example. — Kryptur, May 21 '19 at 14:41
You can limit the number of combining characters. Korean is not about combining character, it has own special rules in Unicode (but you should in any way normalize unicode strings). I think 3 combining characters could be seen in std languages (OTOH I do no think people like to type them). More are just for extra marks (maths, or on some old text to mark singing, pause, intonation, ...) — Giacomo Catenazzi, May 21 '19 at 14:59

How should one reasonably handle combining characters in UTF-8

0 Answers0