I want to find two spcefic words after each other in python(even if ther is an xml tag or punctuation between them)

Question

i have two words:

word_1 = 'وتعالی'
word_2 = ':'

the words can be anything mostly unicode and punctuation

and i want to find them inside a long multiline string like this:

string = <Hadith><Document> قال:</Document> و سأله<Person> الحسین بن أسباط</Person> - و أنا أسمع - عن الذبیح <Prophet> إسماعیل</Prophet> أو<Prophet> إسحاق</Prophet> ؟ فقال:«<Prophet> إسماعیل</Prophet> ، أما سمعت قول الله تبارک و تعالی:«<Ayah WordIndex="۱-۲" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"> و بشرناه </Ayah> » «<Ayah WordIndex="۳-۳" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"><Prophet> بإسحاق</Prophet> </Ayah> » .</Hadith>
فقال : « إسماعیل ، أما سمعت قول الله
تبارک وتعالی :

it means the <xml>وتعالی</xml> : should match my regex but وتعالی الی صتیدشنتد : shouldnt match if the arabic characters are annying let me take an exmaple by english letters: words:

word_1 = "hello"
word_2 = "there"

this should match hello<xml>: there</xml>: but this one hello<xml>guys there</xml> shouldnt

as you see i want to find the index of the end of two words by this pattern:

(as many spaces as it can be or nothing, any punctuations or nothing , any xml tag or nothing {word_1} any xml tag or nothing, any punctuations or nothing {word_2} as many spaces as it can be or nothing, any punctuations or nothing , any xml tag or nothing )

i tried so many patterns provided by chat GPT but none of theme worked for me:

these are the patterns :

fr'\s*[^\w]*<.*?>{re.escape(last_words[0])}<.*?>{re.escape(last_words[1])}[^\w]*<.*?>\s*'

fr'<.*?>{re.escape(word2)}<.*?>{re.escape(word1)}<.*?>'

and none of them worked

i want to find the index of the last two words of the string and then add somthing that i wnat at there for example i want the index of ":" in the string becuase my two words are و تعالی and the ":" and i want to add a xml tag right after ":"

i dont want to remove any xml tag or punctuation from the text (i mean if they can be removed then took them back it's ok)

sorry for bad english in advance

any help will be appriciated!

Before you go much farther with parsing xml via regex, make sure you understand the implications of this notorious post [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). It might help to use something like `beautifulsoup` to search and manage these tags. — JonSG, Aug 28 '23 at 14:32
Not clear what you are asking for. Could you make your question more clear. What is your input xml and how should look like your expected output? The position is also important? If yes, you example should show such a match. The colon must be the first sign in the string and also after the closing tag? — Hermann12, Aug 28 '23 at 16:44
@Hermann12 That is why I took an English example I want to find two words that came after each other(after each other means there can be a xml tag or punctuation between them) just this — Mostafa Kooti, Aug 28 '23 at 18:58
It will get messy quickly. You have RTL text embedded in an LTR environment which will complicate things. Negotiating the embedding levels will be awkward. As has been indicated it may be better to parse it as structured text using beautifulsoup rather than trying to handle it as a string. — Andj, Aug 28 '23 at 20:54
@MostafaKooti, this unfortunately not clarify the unclear request. You search two words and what about the colon in your example, hello: there: ? So you search three or four things? — Hermann12, Aug 28 '23 at 22:05

LetzerWille · Answer 1 · 2023-08-29T03:59:58.447

It seems that the word you have given does not exist in the string. If you select a word that is in the text, re.findall does the job.

install arabic_reshaper
import arabic_reshaper
st = """<Hadith><Document> قال:</Document> و سأله<Person> الحسین بن أسباط</Person> - و أنا أسمع - عن الذبیح <Prophet> إسماعیل</Prophet> أو<Prophet> إسحاق</Prophet> ؟ فقال:«<Prophet> إسماعیل</Prophet> ، أما سمعت قول الله تبارک و تعالی:«<Ayah WordIndex="۱-۲" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"> و بشرناه </Ayah> » «<Ayah WordIndex="۳-۳" Soreh="۳۷" Name="الصافات" Ayah="۱۱۲" AyahList="۱۱۲"><Prophet> بإسحاق</Prophet> </Ayah> » .</Hadith>'"""

word_1 = arabic_reshaper.reshape(word_1)

regex = rf'({word_1}.*?:)'

print(re.findall(regex, arabic_reshaper.reshape(st)))

I want to find two spcefic words after each other in python(even if ther is an xml tag or punctuation between them)

1 Answers1