3

I have some problems with Zalgo on my imageboard.

Texts like below mess up my imageboard. Is there a way to prevent these characters and "fix" or clean up the texts?

Example text Source:

ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ

I tried to use this solution:

$cleanMessage = preg_replace("/[^\x20-\xAD\x7F]/", "", $input_lines);

Taken from here: Remove special characters that mess with formating But it works only for latin chars Can anyone help me?

Community
  • 1
  • 1
aftamat4ik
  • 718
  • 8
  • 14
  • 2
    Please convert this into a proper question and answer if the goal is simply to share something with the community. – PeeHaa Oct 03 '15 at 11:02
  • 1
    Edit your question to turn it into an actual question. Provide all information e.g. sample zalgo texts in the question. And explain what you are trying to do (e.g. stripping characters instead of replacing them). And take you solution and add some extra explanation about what it does and use that in an answer below here. – PeeHaa Oct 03 '15 at 11:06
  • please mark my ansver as valid if you can and close other comments. I cannot do this by myself – aftamat4ik Oct 03 '15 at 11:15
  • You can only mark your answer as "valid" after some time to allow other to both review and possible share their own solutions. – PeeHaa Oct 03 '15 at 11:17
  • 1
    And here's me thinking Stackoverflow's CSS was playing up all of a sudden! – user3791372 Oct 03 '15 at 11:28

1 Answers1

5

This regular expression replaces every superscript symbol in the $text variable:

$text = preg_replace("~[\p{M}]~uis","", $text);

If $text contains char with superscript, for example กิ this regex will remove that superscript symbol and result $text will contain just .

I was improved this regex and changed it to filter only second level of phonetic marks

$text = preg_replace("~(?:[\p{M}]{1})([\p{M}])+?~uis","", $text);

This regex will filter only second level of superscript symbols. Use it if you want to filter deutch or other languages with reserved marks. This regex will transform this word -

͐̈ͩ̎Zͮ͌ͦ͆ͦͤÃ̉͛̄ͭ̈̚LͫG̉̋͂̉Oͨ͌̋͗!

into this: ZÄLͫGO!

I hope second regex will help you.

aftamat4ik
  • 718
  • 8
  • 14
  • 4
    With `\p{M}`you're not only removing Zalgo, you're also removing all characters that fall under the [Unicode General Category `Mark`](http://unicode.org/reports/tr44/#General_Category_Values). With this method you're also removing important codes, such as diacritics commonly used in Latin languages.. Also, there's no need for a character class, nor the `i` or `s` modifier in the pattern. – Mariano Oct 03 '15 at 11:24
  • yes, but this is the only way to stop the ... hm ... trash in comments. – aftamat4ik Oct 03 '15 at 11:31
  • There are lots of languages with characters with one phonetic mark, a couple with two (ancient Greek comes to mind), and perhaps a few with three (Vietnamese?). Any sequence longer than 3 is *definitely* suspect. You could only remove those. – Jongware Oct 04 '15 at 09:17
  • ok. I changed my regex to prevent filter first level of phonetik mark `$text = preg_replace("~(?:[\p{M}]{1})([\p{M}])~uis","", $text);` – aftamat4ik Oct 05 '15 at 15:03
  • Is there any way I could get this working on Python? – jlxip Aug 02 '18 at 16:29