I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.
Let's say I have a page with the following stored in the variable HTMLtext
:
<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>
I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:
pages = ["home", "about", "music", "photos", "stuff", "contact"]
So far, I'm able to use regex to search for one result:
pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]
Running this expression makespages = ['home']
.
How can I get the regex search to continue for the whole text, appending the matched text to this tuple?
(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)