How to completely separate word with non word with python? for Greek and Hebrew

Question

I want to separate all the words from non words in Greek and Hebrew.

I'm using this code:

words = re.findall(r'\w+|\S+', text)

the result is not so satisfying, for example:

it separate ⸂ἡμῶν καὶ κυρίου⸃ -> (⸂ἡμῶν) (καὶ) (κυρίου) (⸃) which I want separated too (⸂) (ἡμῶν)
it doesn't separate ⸂ὑπὲρ⸃ to (⸂)ὑπὲρ(⸃)
it also doesn't separate [ὑμῖν] to ([) (ὑμῖν) (]) for Hebrew. It separate what is not suppose to be separated.

Hi OpenBiblica, have you looked at this? https://stackoverflow.com/questions/25067355/regex-to-match-hebrew-and-english-characters-except-numbers — Francis, Mar 21 '19 at 17:57
likely you need `re.UNICODE`, https://stackoverflow.com/a/393915/9214517 — adrtam, Mar 21 '19 at 18:15

score 0 · Answer 1 · answered Mar 22 '19 at 17:27

Thanks for the informations, I've found the solution with this for greek

words = re.findall(r'\w+|[[]⸂⸃()]|\S+', text)

but I still have problem with hebrew. how to separate this עַל־ אֵ֣לֶּה׀ אֲנִ֣י to this ? (עַל־) (אֵ֣לֶּה) (׀) (אֲנִ֣י)

1 Answers1