Regular Expression in Python for Removing XML Comments and HTML elements

Question

I am parsing RSS content using Universal feed Parser. In the description tag some times I am getting velues like below:

<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<m:Table>Sampe Text</m:Table>

Inorder to remove HTML elements/tags I am using the following Regex.

pattern = re.compile(u'<\/?\w+\s*[^>]*?\/?>', re.DOTALL | re.MULTILINE | re.IGNORECASE | re.UNICODE)
desc = pattern.sub(u" ", desc)

This helps to remove the HTML tags but not the xml comments. How do I remove both the elemnts and XML coments?

The proper way to do this would be to use an XML parser Like @duffymo said. Try [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) — WilHall, Oct 12 '11 at 12:00
A parser is an overkill in this case. You don't need to know the tree structure, tag namespace, name, and attributes only to throw them away, do you? Oh, and @rplnt, you forgot about the CDATA (`<![CDATA[some text some more text]]>`). — pyos, Oct 12 '11 at 12:03

score 5 · Accepted Answer · answered Oct 12 '11 at 12:07

Using lxml:

import lxml.html as LH

content='''
<!--This is the XML comment -->
<p>This is a Test Paragraph</p></br>
<b>Sample Bold</b>
<Table>Sampe Text</Table>
'''

doc=LH.fromstring(content)
print(doc.text_content())

yields

This is a Test Paragraph
Sample Bold
Sampe Text

score 4 · Answer 2 · answered Oct 12 '11 at 11:46

4

Using regular expressions this way is a bad idea.

I'd navigate the DOM tree after using a real parser and remove what I wanted that way.

answered Oct 12 '11 at 11:46

duffymo

305,152
44
369
561

1

As per the accepted answer here http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags. Use beautiful soup instead. – yann.kmm Oct 12 '11 at 11:49
1

You guys from Ban Regex Movement are really freaking me out. Regex cannot be used to **PARSE** XML because tags can be nested (``) but they can be used to **STRIP** tags 'cause a tag is simply anything between angle brackets. Read Wikipedia, dammit. (Sorry.) – pyos Oct 12 '11 at 11:51
There is no movement to ban regexp, it's just to point out that the correct tools should be used for each task, and before stripping out a tag you have to find it, and how would you do that? with a regexp? Bad idea. – yann.kmm Oct 12 '11 at 11:56
So why is it bad then, exactly? – pyos Oct 12 '11 at 11:57
Because the DOM tree has more context, it gives you element type information, and it has a good API (XPath) for finding things. – duffymo Oct 12 '11 at 12:06
@duffymo do you really need to know that context to *remove* elements? Did you even care to read the question? :-/ – pyos Oct 12 '11 at 12:08
I did, I just disagree with you. I'm not trying to ban anything. You and everyone else who would prefer to use RE are free to do so. Context, such as knowing that a particular Element in the DOM tree is a COMMENT, certainly would see germane to me. I can remove all of them in one shot if I wish. Do you even know what a DOM tree is? (Sorry, if you're going to be snarky you should not be upset when someone else returns the favor.) – duffymo Oct 12 '11 at 12:10
You can know that the element is a comment by matching it with ``. The only thing you *can't* parse in XML with regex is the tree structure (the tag's position in it, to be exact), but you don't need it anyway. Btw, if you look at HTMLParser in the Python standard library, you'll find out that it uses regex internally. (And, uh, please forgive me for being rude. ._.) – pyos Oct 12 '11 at 12:18
Perhaps you don't need it, but you can't deny that it's useful to already have the type information without having to match anything. That's my argument. If the parser is using regex internally, that's fine with me. I'm saying that they know regular expressions better than most people, and their code is tested against a much wider audience. Why re-invent it? You're making my case. And you are forgiven. No worries. – duffymo Oct 12 '11 at 12:26

Igor Medeiros · Answer 3 · 2017-01-02T19:59:08.867

There's a simple way to this with pure Python:

def remove_html_markup(s):
    tag = False
    quote = False
    out = ""

    for c in s:
            if c == '<' and not quote:
                tag = True
            elif c == '>' and not quote:
                tag = False
            elif (c == '"' or c == "'") and tag:
                quote = not quote
            elif not tag:
                out = out + c

    return out

The idea is explained here: http://youtu.be/2tu9LTDujbw

You can see it working here: http://youtu.be/HPkNPcYed9M?t=35s

PS - If you're interested in the class(about smart debugging with python) I give you a link: http://www.udacity.com/overview/Course/cs259/CourseRev/1. It's free!

You're welcome!

pyos · Answer 4 · 2011-10-12T12:02:09.473

0

Why so complex? re.sub('<!\[CDATA\[(.*?)\]\]>|<.*?>', lambda m: m.group(1) or '', desc, flags=re.DOTALL)

If you want XML tags intact, you should probably check out a list of HTML tags at http://www.whatwg.org/specs/web-apps/current-work/multipage/ and use the '(<!\[CDATA\[.*?\]\]>)||</?(?:tag names separated by pipes)(?:\s.*?)?>' regex.

edited Oct 12 '11 at 12:02

answered Oct 12 '11 at 11:50

pyos

143
4

Regular Expression in Python for Removing XML Comments and HTML elements

4 Answers4