0

I need a regex pattern which can detect if the given text is in English or not, but I want to include the following:

  • Allowing spaces
  • Allowing numbers and words
  • Allowing multiple lines and tabs
  • Allowing all special characters !@#$%^&*()_-+={}|/<>~`':";[]
  • Allowing URLs, emails
  • If the given text contains any character rather than English, it should be considered a non-English text, this should be applied if the text contains Arabic letters/words like "ا ب ت ... etc." and the same for French "é, â ... etc." and also all of the other languages

In brief, I need to know if the given text, any text with any format, is in English or not. I tried a lot of patterns but I didn't get it, and actually, I don't need to use any language detector as the application will be used offline.

Samples of the texts which should not be accepted:

Hello! ... é

مرحبا بك

للتحميل اضغط هنا ... http://www.google.com

So, if the text contains non-English letter, it should be considered non-English text.

Community
  • 1
  • 1
Ahmed Negm
  • 865
  • 1
  • 11
  • 30
  • BTW, I tried using the following patterns "\p{IsArabic}", "^[a-zA-Z0-9&.\:/-]+$", "^[\x20-\x7E]+$", "[A-Za-z0-9 .,-=+(){}!@#$%^&*_[\]\\]" ... but all gives me incorrect results. – Ahmed Negm Jun 03 '17 at 23:29
  • 2
    café is an English word, though, and many languages have texts in characters that are also used in English… anyway, look into Unicode categories. You can check for letter characters that aren’t a-z. – Ry- Jun 03 '17 at 23:30
  • 6
    You are asking waaaaay too much from regex and somewhat simplifying the detection of a language. ***This is not what regex is for.*** Really. Why not just load an [English word list](http://www-01.sil.org/linguistics/wordlists/english/wordlist/wordsEn.txt) and compare how many words in your text are a match? – spender Jun 03 '17 at 23:35
  • @spender It will take a long time to validate the same, and what if the text was a link or GUID or anything else which is in English but not acts as a valid word. – Ahmed Negm Jun 03 '17 at 23:41
  • What if the word is English, but [doesn't "look" like it's English](https://en.wiktionary.org/wiki/Appendix:English_words_with_diacritics). BTW, a GUID is not English. Did you mean something like "Is the string representable in ASCII"? – spender Jun 03 '17 at 23:43
  • yes ... like the shorten URLs for example, and the same for hexadecimal texts. – Ahmed Negm Jun 03 '17 at 23:49
  • 4
    This is an [XY Problem](https://meta.stackexchange.com/questions/66377/); you need to X and you thought “I know! A regex pattern which can detect if the given text is in English will get me X!” But you tried and that doesn't get you X. We can't give you directions until you tell us your destination and “a regex pattern” is a direction not a destination. – Dour High Arch Jun 03 '17 at 23:50
  • 2
    "A friend of mine, Jürgen, was visiting from Germany" - please let me know if this sentence is English or not? – Enigmativity Jun 04 '17 at 00:31
  • This sounds like a machine learning problem. Naïve (would you consider that an english word?) classification of languages by simple rules and pattern matching is doomed to failure. – Rook Jun 04 '17 at 08:37
  • @Enigmativity it should be considered non-English – Ahmed Negm Jun 04 '17 at 09:10
  • I got this [tool](http://kourge.net/projects/regexp-unicode-block), and I think better to check Unicode characters in English, and if there is anything else, it should be non-English.
    But I need the help in getting these Unicode patterns.
    – Ahmed Negm Jun 04 '17 at 09:38
  • @AhmedNegm - But it's clearly English... – Enigmativity Jun 04 '17 at 09:57
  • @Enigmativity I should consider it as non-English because of "ü" char, and I am dealing with a mobile operator which enforces me to provide a text in a plain format only if the text is written in pure English alphabets including special chars. – Ahmed Negm Jun 04 '17 at 10:04
  • @AhmedNegm - What if I wrote it as "A friend of mine, Juergen, was visiting from Germany"? – Enigmativity Jun 04 '17 at 11:10
  • @Enigmativity, now it is in English. I tried "^[\u0000-\u007F]+$" and I think fits the needs so far. – Ahmed Negm Jun 04 '17 at 11:14

3 Answers3

2

I think I found it, I tried the Basic Latin Unicode category, and it works fine so far. I used:

"^[\u0000-\u007F]+$"

Its idea is about checking if the given text is in English and is written by using English letters only, in addition, it allows special characters. So, if the given text was like this "I met my friend in a café", it is considered as non-English text, as the given text should contain only English letters and avoid any other letters even if typed a name, place ... etc. this was exactly what I need.
Thank you all.


Resources:

Ahmed Negm
  • 865
  • 1
  • 11
  • 30
0

This should work:

@"[^\t\w\d\s$-/:-?{-~!"^_`\[\]]+"

If there is a match, there ARE non-english letters/characters.

BTW, you are just testing if the text contains only those characters where a English speaking person would normally use, NOT what language it is in. To detect a language you need stuffs like Natural Language Processing but NOT regex.

ed9w2in6
  • 133
  • 7
0

In theory it is possible, if regex contained every word from English dictionary.

You can create a regex that detects non-English characters. That will detect text that is definitely not English, but won't be able to confirm it definitely is.

ya23
  • 14,226
  • 9
  • 46
  • 43