How to separate a mixed word (Persian and English) in python

Question

Hi i have a dataset of strings, and some strings have mixed words such as below:

    سلام12World
    دوربینdigital
    سال2012good

... and my desired output is :

   12 سلام world
   دوربین digital
   2012 سال good

here is my code :

 def spliteKeyWord(str):
     regex = r"[\u200b-\u200c]|[0-9]+|[a-zA-Z]+\'*[a-z]*"
     matches = re.findall(regex, str, re.UNICODE)
     return matches

but this code doesnt show my desired output. Is it possible to get something like that output?

blhsing · Accepted Answer · 2019-02-07T20:27:05.207

1

You can use re.findall with an alternation pattern:

def spliteKeyWord(s):
    return re.findall(r'[\dA-Za-z]+|[^\dA-Za-z\W]+', s, re.UNICODE)

edited Feb 07 '19 at 20:27

answered Feb 07 '19 at 20:12

blhsing

91,368
6
71
106

Thanks @blhsing, Does this function work on a text column of a DataFrame? – get data Feb 07 '19 at 20:21
I believe the `str.methods` in `pandas` support regex, so it should work. – r.ook Feb 07 '19 at 20:22
I tested this function it is correct but when there is a word like 'Iphone6' this function convert this word to 'Iphone','6' – get data Feb 07 '19 at 20:26
1

@getdata I've updated my answer so that numbers and English alphabets are grouped together instead. – blhsing Feb 07 '19 at 20:27
Dear @blhsing, your code doesn't consider punctuations. – get data Feb 09 '19 at 12:39

score 0 · Answer 2 · answered Feb 07 '19 at 20:19

Referencing this question, you can use this regex to parse non-ascii characters:

words = ['12سلامWorld','دوربینdigital','2012سالgood']

for w in words:
    re.split(r'([^\x00-\x7F]+)', w)


# ['12', 'سلام', 'World']
# ['', 'دوربین', 'digital']
# ['2012', 'سال', 'good']

This will split everything between the non-ascii words.

How to separate a mixed word (Persian and English) in python

2 Answers2