What I'm trying to do in this example is remove html tags including everything inside them; however, I can never know if the tag is going to be formatted as <tag>...</tag>
or simply be case of <tag ... />
. For this reason, I need the regex to work as an 'or' statement.
In plain English:
Replace '<tag' and everything between it and either the next '</tag>' or the next '/>'
Here is my attempt in Python:
import re
html = '''
<title>Test1</title>
<link rel=\'dns-prefetch\' href=\'//www.test.com\' />
<link rel=\'dns-prefetch\' href=\'//fonts.googleapis.com\' />
<title>Test2</title>
<link rel=\'dns-prefetch\' href=\'//code.ionicframework.com\' />
<link rel=\'dns-prefetch\' href=\'//s.w.org\' />
<link rel=\'dns-prefetch\' href=\'//code.ionicframework.com\' />
<link rel=\'dns-prefetch\' href=\'//s.w.org\' />
'''
html = re.sub(r'\\n|\\r|\\t', '', html)
html = re.sub(r'<!--(.*?)-->', '[coMmEnT]', html)
def removeTag(html, label):
html = re.sub(r'<'+label+'(.*?)</'+label+'>|/>', '~'+label+'~', html)
return html
html = removeTag(html, 'title')
html = removeTag(html, 'link')
print(html)?)</link>|/>?', '[link]', html)
with the variables inserted, the two removeTags() would be:
re.sub(r'<link(.*?)</link>|/>', '~link~', html)
re.sub(r'<title(.*?)</title>|/>', '~title~', html)
Ideally, my output would be:
~title~ ~link~ ~link~ ~title~ ~link~ ~link~ ~link~ ~link~
But instead it is:
~title~
<link rel='dns-prefetch' href='//www.test.com' ~title~
<link rel='dns-prefetch' href='//fonts.googleapis.com' ~title~
~title~
<link rel='dns-prefetch' href='//code.ionicframework.com' ~title~
<link rel='dns-prefetch' href='//s.w.org' ~title~
<link rel='dns-prefetch' href='//code.ionicframework.com' ~title~
<link rel='dns-prefetch' href='//s.w.org' ~title~
Brand new to regex, any guidance would be much appreciated