Modifying a group within Regular Expression Match

Question

So I have a function apart of my Django (v 1.5) Model that takes a body of text and finds all of my tags, such as and converts the correct ones for the user to and removes all of the others.

The below function currently works but requires me to use note_tags = '.*?\r\n' because the tag group 0 finds all of the tags regardless of whether the user's nickname is in there. So curious how I would use the groups so that I can remove all of the un-useful tags without having to modify the RegEx.

def format_for_user(self, user):
    body = self.body
    note_tags = '<note .*?>.*?</note>\r\n'
    user_msg = False
    if not user is None:
        user_tags = '(<note %s>).*?</note>' % user.nickname
        user_tags = re.compile(user_tags)
        for tag in user_tags.finditer(body):
            if tag.groups(1):
                replacement = str(tag.groups(1)[0])
                body = body.replace(replacement, '<span>')
                replacement = str(tag.group(0)[-7:])
                body = body.replace(replacement, '</span>')
                user_msg = True
                note_tags = '<note .*?>.*?</span>\r\n'
    note_tags = re.compile(note_tags)
    for tag in note_tags.finditer(body):
        body = body.replace(tag.group(0), '')
    return (body, user_msg)

Is there a reason you're [using `re` to parse your HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) instead of an actual HTML library like `BeautifulSoup`? Not that it's necessarily impossible for what you want to do, but given that this would be trivial with an HTML library, and you don't know how to write the regexp and have to do clumsy things like stripping off the first 7 characters of a string and your code has a bug in it because you're using `str.replace` on something that may occur more than once and so on… — abarnert, Sep 20 '14 at 05:14
Didn't realize there was an alternative. Will check out Beautiful Soup. — badisa, Sep 20 '14 at 19:31

score 0 · Accepted Answer · answered Sep 29 '14 at 03:32

So abarnert was correct, that I shouldn't be using Regex to parse my Html and instead I should use something along the lines of BeautifulSoup.

So I used BeautifulSoup and this is the resulting code and solves a lot of problems that Regex was having.

def format_for_user(self, user):
    body = self.body
    soup = BeautifulSoup(body)
    user_msg = False
    if not user is None:
        user_tags = soup.findAll('note', {"class": "%s" % user.nickname})
        for tag in user_tags:
            tag.name = 'span'
    all_tags = soup.findAll('note')
    for tag in all_tags:
        tag.decompose()
    soup = soup.prettify()
    return (soup, user_msg)

Modifying a group within Regular Expression Match

1 Answers1