0

I want to separate all the words from non words in Greek and Hebrew.

I'm using this code:

words = re.findall(r'\w+|\S+', text)

the result is not so satisfying, for example:

  • it separate ⸂ἡμῶν καὶ κυρίου⸃ -> (⸂ἡμῶν) (καὶ) (κυρίου) (⸃) which I want separated too (⸂) (ἡμῶν)

  • it doesn't separate ⸂ὑπὲρ⸃ to (⸂)ὑπὲρ(⸃)

  • it also doesn't separate [ὑμῖν] to ([) (ὑμῖν) (]) for Hebrew. It separate what is not suppose to be separated.

Jason Aller
  • 3,541
  • 28
  • 38
  • 38
  • Hi OpenBiblica, have you looked at this? https://stackoverflow.com/questions/25067355/regex-to-match-hebrew-and-english-characters-except-numbers – Francis Mar 21 '19 at 17:57
  • likely you need `re.UNICODE`, https://stackoverflow.com/a/393915/9214517 – adrtam Mar 21 '19 at 18:15

1 Answers1

0

Thanks for the informations, I've found the solution with this for greek

words = re.findall(r'\w+|[[]⸂⸃()]|\S+', text)

but I still have problem with hebrew. how to separate this עַל־ אֵ֣לֶּה׀ אֲנִ֣י to this ? (עַל־) (אֵ֣לֶּה) (׀) (אֲנִ֣י)