i need to divide any kind of html code (string) to a list of tokens. For example:
"<abc/><abc/>" #INPUT
["<abc/>", "<abc/>"] #OUTPUT
or
"<abc comfy><room /></abc> <br /> <abc/> " # INPUT
["<abc comfy><room /></abc>", "<br />", "<abc/>"] # OUTPUT
or
"""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""" # INPUT
[
'<meta charset="utf-8" />',
"<title> test123 </title>",
'<meta name="test" content="index,follow" />',
'<meta name="description" content="Description123" />',
'<link rel="stylesheet" href="../xx/css/default.css" />',
] # OUTPUT
What i tried to do :
def split(html: str) -> List[str]:
if html == "":
return []
delimiter = "/>"
split_name = html.split(" ", maxsplit=1)[0]
name = split_name[1:]
delimited_list = [character + delimiter for character in html.split(delimiter) if character]
rest = html.split(" ", maxsplit=1)[1]
char_delim = html.find("</")
### Help
print(delimited_list)
return delimited_list
My output:
['<abc/>', '<abc/>']
['<abc comfy><room />', '</abc> <br />', ' <abc/>', ' />']
['<meta charset="utf-8" />', '<title> test123</title><meta name="test" content="index,follow" />', '<meta name="description" content="Description123" />', '<link rel="stylesheet" href="../xx/css/default.css" />']
So i tried to split at "/>" which is working for the first case. Then i tried several things. Tried to identify the "name", so the first identifier of the html string like "abc".
Do you guys have any idea how to continue?
Thanks!
Greetings Nick