-1

i have two words:

word_1 = 'وتعالی'
word_2 = ':'

the words can be anything mostly unicode and punctuation

and i want to find them inside a long multiline string like this:

string = <Hadith><Document> قال:</Document> و سأله<Person> الحسین بن أسباط</Person> - و أنا أسمع - عن الذبیح <Prophet> إسماعیل</Prophet> أو<Prophet> إسحاق</Prophet> ؟ فقال:«<Prophet> إسماعیل</Prophet> ، أما سمعت قول الله تبارک و تعالی:«<Ayah WordIndex="۱-۲" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"> و بشرناه </Ayah> » «<Ayah WordIndex="۳-۳" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"><Prophet> بإسحاق</Prophet> </Ayah> » .</Hadith>
فقال : « إسماعیل ، أما سمعت قول الله
تبارک وتعالی : 

it means the <xml>وتعالی</xml> : should match my regex but وتعالی الی صتیدشنتد : shouldnt match if the arabic characters are annying let me take an exmaple by english letters: words:

word_1 = "hello"
word_2 = "there"

this should match hello<xml>: there</xml>: but this one hello<xml>guys there</xml> shouldnt

as you see i want to find the index of the end of two words by this pattern:

(as many spaces as it can be or nothing, any punctuations or nothing , any xml tag or nothing {word_1} any xml tag or nothing, any punctuations or nothing {word_2} as many spaces as it can be or nothing, any punctuations or nothing , any xml tag or nothing )

i tried so many patterns provided by chat GPT but none of theme worked for me:

these are the patterns :

fr'\s*[^\w]*<.*?>{re.escape(last_words[0])}<.*?>{re.escape(last_words[1])}[^\w]*<.*?>\s*'

fr'<.*?>{re.escape(word2)}<.*?>{re.escape(word1)}<.*?>'

and none of them worked

i want to find the index of the last two words of the string and then add somthing that i wnat at there for example i want the index of ":" in the string becuase my two words are و تعالی and the ":" and i want to add a xml tag right after ":"

i dont want to remove any xml tag or punctuation from the text (i mean if they can be removed then took them back it's ok)

sorry for bad english in advance

any help will be appriciated!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 2
    Before you go much farther with parsing xml via regex, make sure you understand the implications of this notorious post [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). It might help to use something like `beautifulsoup` to search and manage these tags. – JonSG Aug 28 '23 at 14:32
  • Thanks ill give it a try you have any idea about it? – Mostafa Kooti Aug 28 '23 at 14:45
  • Not clear what you are asking for. Could you make your question more clear. What is your input xml and how should look like your expected output? The position is also important? If yes, you example should show such a match. The colon must be the first sign in the string and also after the closing tag? – Hermann12 Aug 28 '23 at 16:44
  • @Hermann12 That is why I took an English example I want to find two words that came after each other(after each other means there can be a xml tag or punctuation between them) just this – Mostafa Kooti Aug 28 '23 at 18:58
  • 1
    It will get messy quickly. You have RTL text embedded in an LTR environment which will complicate things. Negotiating the embedding levels will be awkward. As has been indicated it may be better to parse it as structured text using beautifulsoup rather than trying to handle it as a string. – Andj Aug 28 '23 at 20:54
  • @MostafaKooti, this unfortunately not clarify the unclear request. You search two words and what about the colon in your example, hello: there: ? So you search three or four things? – Hermann12 Aug 28 '23 at 22:05

1 Answers1

-1

It seems that the word you have given does not exist in the string. If you select a word that is in the text, re.findall does the job.

install arabic_reshaper
import arabic_reshaper
st = """<Hadith><Document> قال:</Document> و سأله<Person> الحسین بن أسباط</Person> - و أنا أسمع - عن الذبیح <Prophet> إسماعیل</Prophet> أو<Prophet> إسحاق</Prophet> ؟ فقال:«<Prophet> إسماعیل</Prophet> ، أما سمعت قول الله تبارک و تعالی:«<Ayah WordIndex="۱-۲" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"> و بشرناه </Ayah> » «<Ayah WordIndex="۳-۳" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"><Prophet> بإسحاق</Prophet> </Ayah> » .</Hadith>'"""

enter image description here

word_1 = arabic_reshaper.reshape(word_1)

regex = rf'({word_1}.*?:)'

print(re.findall(regex, arabic_reshaper.reshape(st)))

enter image description here

LetzerWille
  • 5,355
  • 4
  • 23
  • 26