-3

I have the string that contains chars from different languages like:

en <chars in english> fr <chars in french> es <chars in spanish>

I need to extract just the substring in specific language from the string above. How may I do that using regex or some other tool in python2.6 ?

ps. it could be in different order like: en (.) es (.) it (.*), the problem is that es or fr or it - is not in latin charset, - thats why regular regex not working correctly with it

swserg
  • 692
  • 1
  • 5
  • 18

1 Answers1

2

Regex works with unicode and you have several options for dicing up your strings. Here is an example where the strings are split on language-code boundaries such as "en" and "es" and put in a list. Then its a matter of iterating the list and finding the language you want.

>>> text = u"en <chars in english> fr <chars in french> es <chars in spanish>"
>>> languages = set((u'en', u'fr', u'es'))
>>> re_languages = '|'.join(languages)
>>> splitter = re.compile(ur'\b({})\b'.format(re_languages))
>>> splitter.split(text)
[u'', u'en', u' <chars in english> ', u'fr', u' <chars in french> ', u'es', u' <chars in spanish>']

>>> parts=splitter.split(text)[1:]
>>> for i in range(0, len(parts),2):
...     if parts[i] == 'es':
...         print parts[i+1]
... 
 <chars in spanish>
>>> 

Or you could find them one at a time

>>> re.findall(r'\b(en|es|fr) (.*?)(?:(?= (?:en|es|fr)\b)|$)', text)
[(u'en', u'<chars in english>'), (u'fr', u'<chars in french>'), (u'es', u'<chars in spanish>')]
>>> 
tdelaney
  • 73,364
  • 6
  • 83
  • 116
  • have you taken into consideration that language-code boundaries `en`, `fr` and `es` - are in different charsets? As example if i try to find it as `r'en (.*) fr'` - it finds nothing, becuase `fr` - in different charset. – swserg Nov 27 '16 at 18:39
  • Um, what? If you are using unicode they aren't in different charsets. If you are using multiple charsets somehow (perhaps multiple windows code pages?) they can't be in the same string anyway. And you'd have to decode them to unicode to get it to work. Testing `re.search(r'en (.*) fr', u"en fr es ")` works fine for me. – tdelaney Nov 27 '16 at 19:07