Multilanguage string regex

Question

I have the string that contains chars from different languages like:

en <chars in english> fr <chars in french> es <chars in spanish>

I need to extract just the substring in specific language from the string above. How may I do that using regex or some other tool in python2.6 ?

ps. it could be in different order like: en (.) es (.) it (.*), the problem is that es or fr or it - is not in latin charset, - thats why regular regex not working correctly with it

do you have reliable structure like `[english word(s)] - [spanish word(s)] - ...` or do you have to guess languages? This would be a hard task. — Jasper, Nov 27 '16 at 17:42
Can you post an example with the expected result ? It's not really clear ... — Loïc G., Nov 27 '16 at 17:44
the structure is above, new language chars starts after: english français español — swserg, Nov 27 '16 at 17:46
What about this kind of regex : `r"english (.*) français (.*) español (.*)"` ? — Loïc G., Nov 27 '16 at 17:49
it could be in different order like: en (.*) es (.*) it (.*), the problem is that es or fr - is not in latin charset, - thats whay regular regex not working correctly with it — swserg, Nov 27 '16 at 17:54
Check this: http://stackoverflow.com/questions/2371780/can-regular-expressions-work-with-different-languages — Mohammad Yusuf, Nov 27 '16 at 17:55
You are working in python 2.6... is the string a proper unicode string? `u"en fr es "`? — tdelaney, Nov 27 '16 at 18:03
One of the advantages of posting code is we can see where any mistakes were made. Regex does work with unicode... although you can have problems if you mistakenly have encoded unicode in a regular `str`. — tdelaney, Nov 27 '16 at 18:20

score 2 · Answer 1 · answered Nov 27 '16 at 18:22

Regex works with unicode and you have several options for dicing up your strings. Here is an example where the strings are split on language-code boundaries such as "en" and "es" and put in a list. Then its a matter of iterating the list and finding the language you want.

>>> text = u"en <chars in english> fr <chars in french> es <chars in spanish>"
>>> languages = set((u'en', u'fr', u'es'))
>>> re_languages = '|'.join(languages)
>>> splitter = re.compile(ur'\b({})\b'.format(re_languages))
>>> splitter.split(text)
[u'', u'en', u' <chars in english> ', u'fr', u' <chars in french> ', u'es', u' <chars in spanish>']

>>> parts=splitter.split(text)[1:]
>>> for i in range(0, len(parts),2):
...     if parts[i] == 'es':
...         print parts[i+1]
... 
 <chars in spanish>
>>>

Or you could find them one at a time

>>> re.findall(r'\b(en|es|fr) (.*?)(?:(?= (?:en|es|fr)\b)|$)', text)
[(u'en', u'<chars in english>'), (u'fr', u'<chars in french>'), (u'es', u'<chars in spanish>')]
>>>

have you taken into consideration that language-code boundaries `en`, `fr` and `es` - are in different charsets? As example if i try to find it as `r'en (.*) fr'` - it finds nothing, becuase `fr` - in different charset. — swserg, Nov 27 '16 at 18:39
Um, what? If you are using unicode they aren't in different charsets. If you are using multiple charsets somehow (perhaps multiple windows code pages?) they can't be in the same string anyway. And you'd have to decode them to unicode to get it to work. Testing `re.search(r'en (.*) fr', u"en fr es ")` works fine for me. — tdelaney, Nov 27 '16 at 19:07

Multilanguage string regex

1 Answers1