0

What I'm trying to do in this example is remove html tags including everything inside them; however, I can never know if the tag is going to be formatted as <tag>...</tag> or simply be case of <tag ... />. For this reason, I need the regex to work as an 'or' statement.

In plain English:

Replace '<tag' and everything between it and either the next '</tag>' or the next '/>'

Here is my attempt in Python:

import re

html = '''
            <title>Test1</title>
        <link rel=\'dns-prefetch\' href=\'//www.test.com\' />
        <link rel=\'dns-prefetch\' href=\'//fonts.googleapis.com\' />
        <title>Test2</title>
        <link rel=\'dns-prefetch\' href=\'//code.ionicframework.com\' />
        <link rel=\'dns-prefetch\' href=\'//s.w.org\' />
        <link rel=\'dns-prefetch\' href=\'//code.ionicframework.com\' />
        <link rel=\'dns-prefetch\' href=\'//s.w.org\' />
'''

html = re.sub(r'\\n|\\r|\\t', '', html)
html = re.sub(r'<!--(.*?)-->', '[coMmEnT]', html)

def removeTag(html, label):
    html = re.sub(r'<'+label+'(.*?)</'+label+'>|/>', '~'+label+'~', html)
    return html

html = removeTag(html, 'title')
html = removeTag(html, 'link')

print(html)?)</link>|/>?', '[link]', html)

with the variables inserted, the two removeTags() would be:

re.sub(r'<link(.*?)</link>|/>', '~link~', html)

re.sub(r'<title(.*?)</title>|/>', '~title~', html)

Ideally, my output would be:

~title~ ~link~ ~link~ ~title~ ~link~ ~link~ ~link~ ~link~

But instead it is:

~title~
<link rel='dns-prefetch' href='//www.test.com' ~title~
<link rel='dns-prefetch' href='//fonts.googleapis.com' ~title~
~title~
<link rel='dns-prefetch' href='//code.ionicframework.com' ~title~
<link rel='dns-prefetch' href='//s.w.org' ~title~
<link rel='dns-prefetch' href='//code.ionicframework.com' ~title~
<link rel='dns-prefetch' href='//s.w.org' ~title~

Brand new to regex, any guidance would be much appreciated

Slat
  • 31
  • 2

0 Answers0