0

I'm trying to write a regex for use in the Python re.findall method to find all comments in a Word comments.xml file that have a particular paraID.

My regex is <w:comment.*?w14:paraId="727F9BCE".*?</w:comment>

I want to match on the 2nd of the below comments, however my regex matches both.

<w:comment w:id="0" w:author="LW 2" w:date="2023-05-03T22:54:00Z" w:initials="LFW2">
    <w:p w14:paraId="698DC7BC" w14:textId="22BC570B" w:rsidR="006040CE" w:rsidRDefault="006040CE">
    <w:pPr>
    <w:pStyle w:val="CommentText"/>
    </w:pPr>
    <w:r>
    <w:rPr>
    <w:rStyle w:val="CommentReference"/>
    </w:rPr>
    <w:annotationRef/>
    </w:r>
    <w:r>
    <w:t>Open comment</w:t>
    </w:r>
    </w:p>
</w:comment>
<w:comment w:id="1" w:author="LW 2" w:date="2023-05-03T22:54:00Z" w:initials="LFW2">
    <w:p w14:paraId="727F9BCE" w14:textId="1EDEEF44" w:rsidR="006040CE" w:rsidRDefault="006040CE">
    <w:pPr>
    <w:pStyle w:val="CommentText"/>
    </w:pPr>
    <w:r>
    <w:rPr>
    <w:rStyle w:val="CommentReference"/>
    </w:rPr>
    <w:annotationRef/>
    </w:r>
    <w:r>
    <w:t>Done comment</w:t>
    </w:r>
    </w:p>
</w:comment>

I understood that modifying the .* quantifier with the ? quantifier would render the regex lazy not greedy (and so match the shortest string that matches), but this isn't the behavior of my re.findall method under Python 3.8.

I am aware I could/should use an xml parser but I've been having many issues with namespaces and thought regex would be simpler.

bichanna
  • 954
  • 2
  • 21
lalau66
  • 19
  • 2
  • 2
    Why are you using regex for this at all? Use an XML parser. – tripleee May 07 '23 at 13:41
  • @triplee I explained that I was having many issues with namespaces when trying to use lxml or similar python packages. The MS Open Office xml standard is extremely complex and the namespace is quite large and confusing. – lalau66 May 19 '23 at 05:57
  • The problem with a _too_ simple solution is that it's wrong. The duplicate explains a number of corner cases where it's hard or impossible to solve parsing problems with regular expressions. Someting like `[^<>]*` instead of `.*?` is probably slightly better, but then that won't work for cases like where the attribute contains a quoted string which contains literal `<` or `>` characters. – tripleee May 19 '23 at 06:53
  • Try python-docx package. – Hermann12 May 19 '23 at 14:08

0 Answers0