Substring any kind of HTML String

Question

i need to divide any kind of html code (string) to a list of tokens. For example:

"<abc/><abc/>" #INPUT
["<abc/>", "<abc/>"] #OUTPUT

or

"<abc comfy><room /></abc> <br /> <abc/> " # INPUT
 ["<abc comfy><room /></abc>", "<br />", "<abc/>"] # OUTPUT

or

"""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""" # INPUT
[
     '<meta charset="utf-8" />',
     "<title> test123 </title>",
     '<meta name="test" content="index,follow" />',
     '<meta name="description" content="Description123" />',
     '<link rel="stylesheet" href="../xx/css/default.css" />',
 ] # OUTPUT

What i tried to do :

def split(html: str) -> List[str]:
     if html == "":
         return []

     delimiter = "/>"
     split_name = html.split(" ", maxsplit=1)[0]
     name = split_name[1:]

     delimited_list = [character + delimiter for character in html.split(delimiter) if character]

     rest = html.split(" ", maxsplit=1)[1]

     char_delim = html.find("</")

     ### Help
     print(delimited_list)
     return delimited_list

My output:

['<abc/>', '<abc/>']
['<abc comfy><room />', '</abc> <br />', ' <abc/>', ' />']

['<meta charset="utf-8" />', '<title> test123</title><meta name="test" content="index,follow" />', '<meta name="description" content="Description123" />', '<link rel="stylesheet" href="../xx/css/default.css" />']

So i tried to split at "/>" which is working for the first case. Then i tried several things. Tried to identify the "name", so the first identifier of the html string like "abc".

Do you guys have any idea how to continue?

Thanks!

Greetings Nick

Perhaps you could try [lxml](https://lxml.de/tutorial.html) ? — han solo, Jul 11 '21 at 16:04
There are libraries. Find a reputable one which meets your needs and use it. There are too many oddly constructed HTML documents out there; you're not going to get all of them right. Better to use code which has proved itself. — rici, Jul 11 '21 at 23:57

zr0gravity7 · Accepted Answer · 2021-07-11T17:14:53.420

You will need a stack data structure and iterate over the string, push the position of opening tags onto the stack, and then when you encounter a closing tag, we assume either:

its name matches the name of the tag beginning at the position on the top of the stack
it is a self-closing tag

We also maintain a result list to save the parsed substrings.

For 1), we simply pop the position on the top of the stack, and save the substring sliced from this popped position until to the end of the closing tag to the result list.

For 2), we do not modify the stack, and only save the self-closing tag substring to the result list.

After encountering any tag (opening, closing, self-closing), we walk the iterator (a.k.a. current position pointer) forward by the length of that tag (from < to corresponding >).

If the html string sliced from the iterator onward does not match (from the beginning) any tag, then we simply walk the iterator forward by one (we crawl until we can again match a tag).

Here is my attempt:

import re

def split(html):
    if html == "":
        return []

    openingTagPattern = r"<([a-zA-Z]+)(?:\s[^>]*)*(?<!\/)>"
    closingTagPattern = r"<\/([a-zA-Z]+).*?>"
    selfClosingTagPattern = r"<([a-zA-Z]+).*?\/>"

    result = []
    stack = []

    i = 0
    while i < len(html):
        match = re.match(openingTagPattern, html[i:])
        if match: # opening tag
            stack.append(i) # push position of start of opening tag onto stack
    
            i += len(match[0])
            continue
        
        match = re.match(closingTagPattern, html[i:])
        if match: # closing tag
            i += len(match[0])
            result.append(html[stack.pop():i]) # pop position of start of corresponding opening tag from stack
            continue
        
        match = re.match(selfClosingTagPattern, html[i:])
        if match: # self-closing tag
            start = i
            i += len(match[0])
            result.append(html[start:i])
            continue
        
        i+=1 # otherwise crawl until we can match a tag
        
    return result # reached the end of the string

Usage:

delimitedList = split("""<meta charset="utf-8" /><title> test123 </title><meta name="test" content="index,follow" /><meta name="description" content="Description" /><link rel="stylesheet" href="../layout/css/default.css" />""")

for item in delimitedList:
    print(item)

Output:

<meta charset="utf-8" />
<title> test123 </title>
<meta name="test" content="index,follow" />
<meta name="description" content="Description" />
<link rel="stylesheet" href="../layout/css/default.css" />

References:

The openingTagPattern is inspired from @Kobi 's answer here: https://stackoverflow.com/a/1732395/12109043

Hey, thank you zr0gravity7 for your answer. It is a very nice approach. I fixed two issues adding the if to the self closing tag. `if result == [] and html[start:i] != html[:i]: continue` — Nick Müller, Jul 12 '21 at 16:02
You're welcome! Not quite sure what you mean by the snippet in your comment, but I'm glad you were able to make it work. — zr0gravity7, Jul 12 '21 at 19:10
Got some problems. with the formatting in the command here :( — Nick Müller, Jul 12 '21 at 20:08

Substring any kind of HTML String

1 Answers1