0

I have a kind of Q&A site (very approximately) where users enter questions to be answered by our Staff. I am quite concerned about users posting non-questions, which are an annoyance. The best I thought to far is a system to detect whether the text is in Italian (our users' language), and if it is, to check if it's not a copypasta against a list of common copypastas.

So, long story short: users will input some text, I have to make sure it's a proper question in Italian and not random characters.

Giulio Muscarello
  • 1,312
  • 2
  • 12
  • 33
  • 1. Which platform/programming are you using. 2. Any possibility of giving them predefined questionairre? e.g. There are drop down lists they will choose and then if all are alright (what category of question(travel/books/food), what region, what **language**) so then you provide user define lanaguge to enter the final limited text question. That way, the question is more organized and filtered before they send it to your staff... – bonCodigo Jan 05 '13 at 18:24
  • 1
    It's a really simple idea, but you can try to check if 30% (or some other value) of words are from Italian dictionary. Maybe it will be enough. – zch Jan 05 '13 at 18:26
  • I guess you are not meaning spamming but just not relevant questions that is beeing posted by some users? In that case i think it will be very difficult to detect wether submitted text is relevant or not. This is the reason for having dedicated forum moderator watching the forum or having system such as stackoverflow where forum users self can vote and close one question. If you are bothered with spam I will recommand using Captcha to ensure question is posted by a human beeing. Sorry if I've missunderstood your question. – Ismar Slomic Jan 05 '13 at 18:36
  • @BonCodigo 1. PHP 2. I'm thinking of making some kind of "filtering" with topics (i.e., "Is it a problem with the website? Do you have a question about messaging?"), but I don't see how can it affect managing spam messages. – Giulio Muscarello Jan 06 '13 at 15:23
  • @zch That's quite simple, yet it might be effective, as I aim to filter messages like "jjohujjoihjkiuihugyhbihub", not actual spam (i.e. "Visit http://www.stackoverflow.com/ to meet hot women near you"). +1. – Giulio Muscarello Jan 06 '13 at 15:29
  • @IsmarSlomic I think you might have misunderstood what I meant for "irrelevant". In my precedent comment, I explained it better: just filter random characters, not commercial spam. Having moderators would surely work, but I think there are automated approaches with minor costs [actually, a trade between computational costs and economical ones]. – Giulio Muscarello Jan 06 '13 at 15:32

4 Answers4

1

Not sure what language you'll make

http://www.easywayserver.com/blog/java-string-contains-example/

How do I check if a string contains a specific word in PHP?

Checking if the input String (Question) contains any forbidden word would be one way to go at it.

Pseudo code

ListOfForbiddenWords;
if Language = Italian
    if Input does not contain any of ListOfForbiddenwords
         //It's fine
    else
         //Don't spam
else
    //You're not Italian

Not quite sure on what's the best way to check if a string is written in a specific language

Community
  • 1
  • 1
Floris Velleman
  • 4,848
  • 4
  • 29
  • 46
  • Interesting how the question said, "How to detect if a text is in a given language?", and you just write "If Language = Italian". Also, this doesn't fit the requirements: "Some text, I have to make sure it's [...] in Italian *and not random characters*." I think this kind of approach (check against a list of forbidden words) would let the message "jgujqkwfjpihoujlkfa" pass, while it shouldn't. – Giulio Muscarello Jan 06 '13 at 15:37
0

You can use Rosoka's language detection if you want a commercial option. You can try it out at Rosoka Cloud for about $1/hour with all of the features. The language ID is available as a stand alone library. So you can feed it examples inputs that you are concerned with to see if it gives back what you want.

Random text like "jgujqkwfjpihoujlkfa" will be flagged as ROMANIZATION or a tag based on the underlying codeblocks that where used if it is non ascii. i.e. input that is not a language will not be tagged as a language.

mike
  • 21
  • 2
0

There are many free language detection libraries. One popular example is libexttextcat from LibreOffice. There are many clones and ports and variants if you don't want a C library; see e.g. http://odur.let.rug.nl/vannoord/TextCat/competitors.html for an (incomplete, slightly dated) list of pointers.

tripleee
  • 175,061
  • 34
  • 275
  • 318
-1

A similar question was asked here a while ago and the answers listed a number of language detection API solutions. One of the answers points to detectlanguage.com which offers up a limited free language detection service.

Community
  • 1
  • 1
SDR
  • 1