To resolve this problem in Persian we have a character calls Zero-width_non-joiner (or نیمفاصله in Persian or half space or semi space) which has two symbol codes. One is standard and the other is not standard but widely used :
- \u200C : http://en.wikipedia.org/wiki/Zero-width_non-joiner
- \u200F : Right-to-left mark (http://unicode-table.com/en/#200F)
As I know Dari is very similar to Persian. So first of all you should correct all the words like زنده گی
to زندهگی
and convert all wrong spaces to half spaces then you can simply use this regex to match all words of a sentence:
[\u0600-\u06FF\uFB8A\u067E\u0686\u06AF\u200C\u200F]+
Online demo (the black bullet in test string is half space which is not recognizable for regex101 but if you check the match information part and see Match 5
you will see that is correct)
For converting wrong spaces of a huge text to half spaces there is an add on for Microsoft word calls virastyar which is free and open source. You can install it and refine your whole text. But consider this add on is created for Persian and not Dari. For example In Persian we write زندهگی
as زندگی
and it can not correct this word for you. But the other words like می شود
would easily corrects and converts to میشود
. Also you can add custom words to the database.