-1

I need some help with regular expression in python. I have many files which, unfortunately, contains invaid xml: as below some 'img' tag is not closed.

<a href="a_url" class="url" target="_blank">
   <img ng-src=" {{getImageUrl('homeHelp64.jpg')}}" alt="?" width="16" height="16">
      Home
</a>

In Python what I would like to do is to find all such 'img' tags that are not closed and replace it with close tag like below ('/' before >):

<img ng-src=" {{getImageUrl('homeHelp64.jpg')}}" alt="?" width="16" height="16"/>

using the following pattern I can get all instances of img tag but i need to get only the ones that are not closed.

pattern = '(img.*?)>'

will appreciate your help in defining the pattern and how to replace the 'img' and close the xml tag at the end.

user1749707
  • 159
  • 14
  • 1
    You are probably better off using an html/xml parser as this type of match is horrific in Regex. For example, Regex doesn't understand quoted items (that may have contain `>`) or child elements – Jonathan Twite Jun 09 '20 at 14:37
  • 2
    You might try `]*(?=(?<!/)>)` and replace with `\g<0>/` but you should read [this page](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) why that can be really tricky. – The fourth bird Jun 09 '20 at 14:40
  • @Thefourthbird Thanks a lot !! – user1749707 Jun 09 '20 at 15:14
  • @Thefourthbird using that pattern i do not get '>' at the end. is there a possibility to get it as well? – user1749707 Jun 09 '20 at 15:17
  • @JonathanTwite, I tried xml parsing first with since the xml is invalid I was getting error. – user1749707 Jun 09 '20 at 15:18
  • If there is a `>` it will put the `/` in front of it right? https://regex101.com/r/KYGD8D/1 – The fourth bird Jun 09 '20 at 15:28
  • @user1749707, if this is HTML, the image tag doesn't technically need closing so this is syntactically correct (https://html.spec.whatwg.org/multipage/embedded-content.html#the-img-element) If this is XHTML, then you are probably better served to parse it with an HTML to XHTML converter – Jonathan Twite Jun 09 '20 at 20:59

1 Answers1

-1

If I understood the problem correctly, I managed to write the required regular expression. By link, you can test the expression. Required flag - s

https://regex101.com/r/YctLzb/5/tests

\<(\w+)\s[^<>]+\>(?!.*\<\/\1\>)

Update: this works without s flag and matches tags without attrs: https://regex101.com/r/6nwHQP/1/tests

\<(\w+)(?:\s[^<>]+)?\>(?!(?:.|\n)*\<\/\1\>)
  • The links to that website are not very helpful. It's better to include the examples of what those patterns do and don't match here in your answer. – Han-Kwang Nienhuys Jun 09 '20 at 17:19