0

How can I find all words with at least one non latin letter (arabic, chinese...) in them using regex.h library?

cityدبي

James Webster
  • 31,873
  • 11
  • 70
  • 114
xXx_CodeMonkey_xXx
  • 800
  • 1
  • 7
  • 14
  • See http://stackoverflow.com/questions/2124010/grep-regex-to-match-non-ascii-characters and its answer, hope it helps. – Paolo Stefan Sep 19 '12 at 07:49

3 Answers3

2

How about:

(?=\pL)(?![a-zA-Z])

This will match a letter in any alphabet that is not a latin letter:

not ok - cityدبي
ok - city
not ok - دبي
Toto
  • 89,455
  • 62
  • 89
  • 125
  • 3
    Won't this match latin characters with accents, like ąęśćżół? I have to say that I'm quite irritated by how some English-speaking people apparently treat accented latin letters as second-class. – Jan Warchoł Jul 02 '14 at 11:37
0

Try this :

[a-zA-Z]*[^A-Za-z \d]+[a-zA-Z]*

Means : One or more non latin letter preceded or followed by one or more latin letter i.e. a word containing atleast 1 non latin character. See demo with some random text: http://regexr.com?326s3

You may need to adjust this regex to your needs,and include things like digits,special characters,word boundaries as per your input.

DhruvPathak
  • 42,059
  • 16
  • 116
  • 175
-1

just use [^a-zA-Z] if not match, it should contain an international character...

frogwang
  • 64
  • 2
  • 2
    -1 or a space, or a period ... Depends on the locale and encoding anyway. – tripleee Sep 19 '12 at 08:25
  • @tripleee I think regex.h support unicode. And I dont think we need a more complicated regular expression to distinct a pure latin word... – frogwang Sep 19 '12 at 08:39