1

I have a slightly unusual profanity-related question.

Now we're used to dealing with profanity-filtering of user-generated content — any method is imperfect, but products like CleanSpeak and WebPurify do a good-enough job.

The problem we have at the moment, though, is that we've been building an engine to run promotional-code–based competitions, that will be used internationally. We could do with checking that none of these codes is profane in Latin American Spanish or Malay (at least in the first instance), to make sure we don't send out a code that's equivalent to FUCK23 or PEN15 or something.

We've tried Googling around and asking people we know, but we can't find an easy way of getting hold of an es-419 or an ms profanity list to filter the codes against. As there are literally millions of codes per locale, we'd rather do an offline check than hit an API for each code (which would be expensive both in terms of bandwidth and usage fees).

I know this is a bit of a long shot, but does anyone know of a good source for profanity lists in different languages?

#disclaim: We know that no profanity filtering is perfect, that it's essentially futile with user-generated content and we have read SO #273516: How do you implement a good profanity filter? — that's not what we're asking.

Community
  • 1
  • 1
Owen Blacker
  • 4,117
  • 2
  • 33
  • 70
  • 3
    Not helpful to you, but reminds me of this: http://thedailywtf.com/Articles/The-Automated-Curse-Generator.aspx – Ben Parsons Jan 13 '12 at 12:51
  • I'd not seen that story before. That is truly awesome; thank you for brightening up my lunch break :o) – Owen Blacker Jan 13 '12 at 13:08
  • The crucial sentence from the link that Ben gave you is: "I've been thinking about it and it's too dangerous to just have a bad-word filter. We'll never be able to think up every possible offensive-sounding combination.". That's it. There is simply no way to filter profanity, especially when somebody write it down in some special way. BTW, I wanted to share the same article, but Ben was faster. – Paweł Dyda Jan 13 '12 at 13:34
  • Do you actually believe one word from 'thedailywtf'? Ever since the riciculous robot throwing objects I've concluded that it's fictional. – bmargulies Jan 16 '12 at 22:50

2 Answers2

1

I have had the same thoughts. in trying to generate 6 character codes for a project i am doing. I decided to reduce the likelyhood of obvious porfain codes So i removed the vowels that i found in as many "bad" words as i could think of, from my intial base 36 generation code. Leaving me with something more like a base 28 system that did not include a,e,i,o,u, 1,0. the one and zero were removed to reduce confusion between those characters in some fonts with I,L,O's so far I have not seen a "profain" code genreated. Although base 28 has 1.something billion unique combinations. i cannot vouch for other languages, and had not even considered it...

  • Yeah, I'm pretty sure that's roughly what we ended up doing. I'm catching up with the developer who was working on it next week, so I'll be able to post more details then, I would hope. And welcome to Stack Overflow, Grant :o) – Owen Blacker Apr 26 '12 at 11:35
1

Building or finding lists in other languages is extremely time consuming and difficult (trust me, we've built many of them at Inversoft). You might be better off tweaking the code generators instead (from what I could tell your code is generating the promotional codes rather than humans).

The best way to tweak a generator is to ensure that the codes can't easily form words based on the general use of consonants and vowels in most European languages. Things get a bit dicey in Polish and others, but it usually works.

Generally, most codes that start with a vowel are followed by another vowel or a non-joining consonant (like 'q' without a 'u'). If the code starts with a consonant then the next character is the same consonant or one that has a low probability of being used. For example, if you start with 's' then adding 'g' is a good choice.

You could also use wiktionary or other similar sources (like Linux dictionary files) to build a statistical approach to this. By extracting the probability of characters being next to each other, you should be able to generate codes with good accuracy of never being words in any language.

However, if I misread your question and you aren't generating the codes programmatically, you can ignore my response completely. :)

voidmain
  • 1,625
  • 1
  • 14
  • 14