22

I've read about how Zalgo text works, and I'm looking to learn how a chat or forum software could prevent that kind of annoyance. More precisely, what is the complete set of Unicode combining characters that needs to:

a) either be stripped, assuming chat participants are to use only languages that don't require combining marks (i.e. you could write "fiancé" with a combining mark, but you'd be a bit Zalgo'ed yourself if you insisted on doing so); or,

b) reduced to maximum 8 consecutive characters (the maximum encountered in actual languages)?

EDIT: In the meantime I found a completely differently phrased question ("How to protect against... diacritics?"), which is essentially the same as this one. I made its title more explicit so others will find it as well.

Robert Columbia
  • 6,313
  • 15
  • 32
  • 40
Dan Dascalescu
  • 143,271
  • 52
  • 317
  • 404
  • 6
    why should a a chat or forum software prevent vertical rubbish automatically, when it cannot do the same with horizontal rubbish? – Walter Tross Mar 09 '14 at 01:14
  • 28
    Y̒̌͛́̓̀͊ͫ͌ͦo͊͂ͤ̊̒̆͊ͪ̋ͯͥ͌ͧ͑̑̂͐͗̏u̇̽̿͋̋́̅̐̄ͮ̿͆̚r͊ͥͣ͂̑ͩ̒̑̋̊̅ ͬ̔̍̾̓ͩ̇͒ͯ͗͐͐ͧ̍͊̚c͋̈́̂̽ͬ͒͊ͣͤ͊̋͛̿͒̚̚oͩͫ͛̂̄̐̽̑ͬ͑̍̃ͯm̉̈́̾ͨ̆̊ͨͪ͌mͫ̾͋ͨͤ̈́͑́͐́eͮ͐̍̌ͬ͛̃̃̿ͪ̌͂n̊͋ͫ͆t̊͊ͪ̌́͆ ̎̔̉ͮ̋̋͐̐ͮ͛̈̆̉̈́ͣ̎̐̏̚i͆̌͆̃̾̽ͥ̎͊́s̑̌̓̆͊́ͦ͆̍̇̌̀̈̓̈́ͪ̚ ̍̀͌ͩͮ́̿́̓̈́̍ͣ̔ṁ̋̑̉ͤi̒̌̿̔ͣ̇͐ͭͫͬ̎͊ͬ͊̓s͗̽ͦ̄͋ͤ͆͊ͬ̈́̂̌ͦ͒̈́̓ͪ̏gͣ͆̃͛ͨͩ̚u͆͆̄ͬ̍ͯiͬͩ̎̑d̍ͩ̐ͫ̍e͗ͪ̀ͥͨ̀͌̒ͦͩͣ̎ͯ͂̔ͤd̆ͭ͆.̑̃͂̆̀̈́̽ͭ̂ͮ̓ If that was not demo enough, here's why: 1) a crap regular comment affects only itself, while a Zalgo one affects others. 2) Because it *is* possible to automatically filter out Zalgo, while automatically filtering out low-quality comments requires developing general AI. – Dan Dascalescu Mar 09 '14 at 01:38
  • possible duplicate of [What's up with these Unicode characters?](http://stackoverflow.com/questions/10414864/whats-up-with-these-unicode-characters) – nwellnhof Mar 09 '14 at 01:54
  • The question does not describe a programming problem. Rather, it asks “what should I set as the goal in programming when the purpose is to avoid annoying me/others with Zalgo?” Besides, it’s a fairly broad question. Which of the world’s 6,000 languages do you intend to consider, and do you think it’s OK to filter out characters in English text written in Normalization Form D? (E.g., “fiancé” *can* validly be written using a combining mark, though it usually isn’t.) – Jukka K. Korpela Mar 09 '14 at 06:24
  • @JukkaK.Korpela: Yes, it's a programming question because one of the answer is "Use the [strip-combining-marks](https://github.com/mathiasbynens/strip-combining-marks) library. Yes, I think it's OK to filter out these characters from English text - nobody writes "fiancé" using them. PS: I'm the one who [fought](http://stackoverflow.com/posts/6580026/revisions#spacer-2aa1874d-ed9b-4884-80ad-59722c8d4d26) to have your answer to "How does Zalgo text work" recognized as the correct one, so you're welcome. – Dan Dascalescu Mar 09 '14 at 07:46
  • [Edited](http://stackoverflow.com/revisions/10414864/4) the title of the "What's up with these Unicode characters" to something search queries would actually find, and voted to close my own question. To the two folks to votes to close because "it was unclear what I was asking" - if you don't understand the question, maybe it's not in your field and you should rather choose "Skip"? Plenty of people understood what I was asking. – Dan Dascalescu Mar 09 '14 at 07:48
  • 4
    @DanDascalescu I vote to keep open, if only for your enlightening demonstrative comment. Finding this kind of thing puts a smile on my face and that's worth more to me than normalizing SO. – iwein Mar 09 '14 at 08:04
  • @iwein: you have restored my faith in StackOverflow, after [experiences like these](https://pinboard.in/u:dandv/t:stackoverflow/t:against). – Dan Dascalescu Mar 09 '14 at 08:20
  • @hivert & whoever else didn't understand the question: as you can see, there are plenty of comments and even answers from people who did understand it. I've further edited the question for clarity and precision. – Dan Dascalescu Mar 09 '14 at 10:01
  • 3
    I really think this should be reopened. This question can be answered with code and I posted some code as an answer. – nwk Mar 09 '14 at 12:30
  • @nwk: unfortunately those who get to decide to reopen or not, know much less than you do about the topic at hand. They just happen to have more points. That's just how StackOverflow works. – Dan Dascalescu Mar 09 '14 at 12:34
  • Code can be posted as answers, but this does not make it a programming question. It is a *design* question, as it leaves it open what should really be done. The question has now been edited to be somewhat more specific, partly on arbitrary grounds. But if the question is reformulated as a specific question, with a specification for what the program should do, it should be posted as a new question, tagged with the programming language(s) that would be used, and containing code written so far, with an explanation of why it’s unsatisfactory. – Jukka K. Korpela Mar 09 '14 at 12:50
  • 1
    @JukkaK.Korpela: so I should go to all the trouble to post a new question, only to have it closed again by you or some others, based on grounds you can always dig from your enormous rule book? No, thanks. – Dan Dascalescu Mar 09 '14 at 12:54
  • @JukkaK.Korpela: I don't know whether to agree in this case (I think there's a design _and_ a programming question in this post, the most of the design part being in the first line) but I see your point about design. What would be the right Stack Exchange for this question as it stands and questions like it, Programmers? – nwk Mar 09 '14 at 13:18
  • @nwk: I fail to see how this would be a design question. I'm asking about a character set. I've even added "JavaScript" as a tag to appease Jukka (whose Unicode work I greatly respect, BTW, and have been aware of since 2004), but the point is that *I think* we're looking for nothing more than a regexp character class. – Dan Dascalescu Mar 09 '14 at 13:25
  • @DanDascalescu: I should clarify: what sounds designy to me is the sentence ending with "how can a chat or forum software prevent that kind of annoyance?", not the rest of your question. – nwk Mar 09 '14 at 13:38
  • 14
    You cannot prevent Zalgo... Ḧ̛̪̠́̌ͦ̔̄̐̓͗ͭ̒̀͗́̚ͅE̻̪͇͓͓͖͕̖͓̘͚̰̺͔̻̬͙͑͂̑ͫͧ̊̏ͨ͛ͯ̅̋͑ͤͤ̅̒͘͞ͅͅ ̧̢̡̩̥̯̤͚̤͍͓͙̳̞̦̓̓̇ͧ̎̐̓ͤ̀͜ͅC̦̫̗̠̝̅̀ͨ̊̕͝͝ͅŌ̷̝̝̰̞͓͎̫̖͚̲̟̽ͫ́͛̋̍̒ͦ̊̂̈ͤ͆͒͞ͅṂ̴̠̠̜̣̹ͥ̓̇͐̇ͬͣ͆̆̈́̚͡͝Ē̵̳̞̝̙͕ͬ͒ͮ̀͑͊̎͑̔̀̕͜͞Ş̶̡̛̠̠͙̱̣̝͔̻̻̩̬ͮ͑̀̒͂̐̑̋̚͘ –  May 18 '15 at 00:00
  • 1
    In the context of an HTML page, a simpler solution than trying to filter out certain combining diacritical marks is to use the CSS property `overflow: hidden`. For example, if I inspect the `td.comment-text` elements on this page and add that style, they no longer visually overflow onto other comments. – Nathan Long Sep 27 '16 at 17:51

5 Answers5

20

Assuming you're very serious about this and want a technical solution you could do as follows:

  1. Split the incoming text into smaller units (words or sentences);
  2. Render each unit on the server with your font of choice (with a huge line height and lots of space below the baseline where the Zalgo "noise" would go);
  3. Train a machine learning algorithm to judge if it looks too "dark" and "busy";
  4. If the algorithm's confidence is low defer to human moderators.

This could be fun to implement but in practice it would likely be better to go to step four straight away.

Edit: Here's a more practical, if blunt, solution in Python 2.7. Unicode characters classified as "Mark, nonspacing" and "Mark, enclosing" appear to be the main tools used to create the Zalgo effect. Unlike the above idea this won't try to determine the "aesthetics" of the text but will instead simply remove all such characters. (Needless to say, this will trash text in many, many languages. Read on for a better solution.) To filter out more character categories add them to ZALGO_CHAR_CATEGORIES.

#!/usr/bin/env python
import unicodedata
import codecs

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print ''.join([c for c in unicodedata.normalize('NFD', line) if unicodedata.category(c) not in ZALGO_CHAR_CATEGORIES]),

Example input:

1
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
2
H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡
3

Output:

1
How does Zalgo text work?
2
How does Zalgo text work?
3

Finally, if you're looking to detect, rather than unconditionally remove, Zalgo text you could perform character frequency analysis. The program below does that for each line of the input file. The function is_zalgo calculates a "Zalgo score" for each word of the string it is given (the score is the number of potential Zalgo characters divided by the total number of characters). It then looks if the third quartile of the words' scores is greater than THRESHOLD. If THRESHOLD equals 0.5 it means we're trying to detect if one out of each four words has more than 50% Zalgo characters. (The THRESHOLD of 0.5 was guessed and may require adjustment for real-world use.) This type of algorithm is probably the best in terms of payoff/coding effort.

#!/usr/bin/env python
from __future__ import division
import unicodedata
import codecs
import numpy

ZALGO_CHAR_CATEGORIES = ['Mn', 'Me']
THRESHOLD = 0.5
DEBUG = True

def is_zalgo(s):
    if len(s) == 0:
        return False
    word_scores = []
    for word in s.split():
        cats = [unicodedata.category(c) for c in word]
        score = sum([cats.count(banned) for banned in ZALGO_CHAR_CATEGORIES]) / len(word)
        word_scores.append(score)
    total_score = numpy.percentile(word_scores, 75)
    if DEBUG:
        print total_score
    return total_score > THRESHOLD

with codecs.open("zalgo", 'r', 'utf-8') as infile:
    for line in infile:
        print is_zalgo(unicodedata.normalize('NFD', line)), "\t", line

Sample output:

0.911483990148
True    Señor, could you or your fiancé explain, H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡

0.333333333333
False   Příliš žluťoučký kůň úpěl ďábelské ódy.  
nwk
  • 4,004
  • 1
  • 21
  • 22
  • Appreciate the elaborate solution, but I was looking for a simple character range regular expression, or a library like [strip-combining marks](https://github.com/mathiasbynens/strip-combining-marks). – Dan Dascalescu Mar 09 '14 at 09:56
  • 2
    I wasn't quite sure how serious you were about looking for a solution (i.e., if you wanted something that's fun to play with vs. something you could plug in a forum today). I implemented two more practical solutions in Python; it was a fun little bit of research to figure this stuff out. Since this question is on hold right now I can't add my code as a separate answer, so I added it here. – nwk Mar 09 '14 at 12:26
  • I have (professionally) come across international text VALIDLY containing characters belonging to the two character classes you are banning, and please be aware that a word in CJK easily consists of a SINGLE character (and also be aware that in several langauges words may NOT be separated by non-word characters). – Walter Tross Mar 09 '14 at 14:42
  • @WalterTross: "Banned" is a misnomer in the case of the second code snippet because it doesn't actually ban those marks. I'll change that. – nwk Mar 09 '14 at 14:58
  • @DanDascalescu Given that Regex is one of the ways in which Zalgo texts were generated, I would advise against trying so....http://stackoverflow.com/a/1732454/1808494 – Aron Mar 09 '17 at 04:37
13

Make the box overflow:hidden. It doesn't actually disable Zalgo text, but it prevents it from damaging other comments.

.comment {
  /* the overflow: hidden is what prevents one comment's combining marks from affecting its siblings */
  overflow: hidden;
  /* the padding gives space for any legitimate combining marks */
  padding: 0.5em;
  /* the rest are just to visually divide the three comments */
  border: solid 1px #ccc;
  margin-top: -1px;
  margin-bottom: -1px;
}
<div class=comment>The below comment looks awful.</div>
<div class=comment>H̡̫̤ͭ̓̓̇͗̎̀ơ̯̗͒̄̀̈ͤ̀͡w͓̲͙͋ͬ̊ͦ̂̀̚ ͎͉͖̌ͯͅͅd̳̘̿̃̔̏ͣ͂̉̕ŏ̖̙͋ͤ̊͗̓͟͜e͈͕̯̮͌ͭ̍̐̃͒s͙͔̺͇̗̱̿̊̇͞ ̸ͩͩ͑̋̀ͮͥͦ̊Z̆̊͊҉҉̠̱̦̩͕ą̟̹͈̺̹̋̅ͯĺ̡̘̹̻̩̩͋͘g̪͚͗ͬ͒o̢̖͇̬͍͇̔͋͊̓ ̢͈͂ͣ̏̿͐͂ͯ͠t̛͓̖̻̲ͤ̈ͣ͝e͋̄ͬ̽͜҉͚̭͇ͅx̌ͤ̓̂̓͐͐́͋͡ț̗̹̄̌̀ͧͩ̕͢ ̮̗̩̳̱̾w͎̭̤̄͗ͭ̃͗ͮ̐o̢̯̻̾ͣͬ̽̔̍͟r̢̪͙͍̠̀ͅǩ̵̶̗̮̮ͪ́?̙͉̥̬ͤ̌͗ͩ̕͡</div>
<div class=comment>The above comment looks awful.</div>
notriddle
  • 640
  • 4
  • 10
  • 1
    Highly practical suggestion. Validation measures such as `''.join((c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn'))` are resource intensive and the opposite of subtle. – Hassan Baig Mar 04 '18 at 02:47
  • I think you mean "awful". – S.S. Anne Jun 11 '19 at 17:52
6

A related question was asked before: https://stackoverflow.com/questions/5073191/how-is-zalgo-text-implemented but it's interesting to go into prevention here.

In terms of preventing this you can choose several strategies:

  1. prevent combining diacritics entirely (and piss off many international users),
  2. filter out combining characters using whitelisting or blacklisting (and piss off a smaller percentage of international users)
  3. prevent a certain number of combining characters (and piss of an even smaller percentage of users)
  4. have a healthy moderator community (with all the downsides that has, see your question as an example here)
Community
  • 1
  • 1
iwein
  • 25,788
  • 10
  • 70
  • 111
  • 4
    "with all the downsides that has, see your question as an example here" - priceless :) – Dan Dascalescu Mar 09 '14 at 08:26
  • The smallest unit of text that is usually zalgoed is a line. Rather than the absolute number of combining characters you could look at their density (percentage) in each line. – nwk Mar 09 '14 at 08:55
  • 1
    @nwk good trick, but I was thinking to disallow successive combining characters (meaning you can only reach a certain height/depth) – iwein Mar 10 '14 at 13:13
4

You can get rid off Zalgo text in your application using strip-combining-marks by Mathias Bynens.

The module strip-combining-marks is available for browsers (via Bower) and Node.js applications (via npm).

Here is an example on how to use it with npm:

var stripCombiningMarks = require("strip-combining-marks");
var zalgoText = 'U̼̥̻̮͍͖n͠i͏c̯̮o̬̝̠͉̤d͖͟e̫̟̗͟ͅ';
var stripptedText = stripCombiningMarks(zalgoText); // "Unicode"
Benny Code
  • 51,456
  • 28
  • 233
  • 198
  • 3
    For anyone coming here via Google, be aware that strip-combining-marks will trash some valid emojis. It turns out the blue and white number emojis use combining marks... https://emojipedia.org/keycap-digit-one/ – carpii Nov 26 '17 at 20:26
2

Using PHP and the mindset of a demolition worker you can get rid of the Zalgo with the iconv function. Of course that will kill any other UTF-8 chars too.

$unZalgoText = iconv("UTF-8", "ISO-8859-1//IGNORE", $zalgoText);
solitud
  • 683
  • 5
  • 15