RegularExpressions to Remove specific HTML tag

Question

hi first sorry for my English

i need to remove one specific HTML tag not all tags

this the tag i want to remove

xxx

<object data="/dictionary/flash/SpeakerApp16.swf" type="application/x-shockwave-flash" width=" 16" height="16" id="pronunciation"> <param name="movie" value="/dictionary/flash/SpeakerApp16.swf"><param name="flashvars" value="sound_name=http%3A%2F%2Fwww.gstatic.com%2Fdictionary%2Fstatic%2Fsounds%2Fde%2F0%2Fman.mp3"><param name="wmode" value="transparent"><a href="http://www.gstatic.com/dictionary/static/sounds/de/0/man.mp3"><img border="0" width="16" height="16" src="/dictionary/flash/SpeakerOffA16.png" alt="listen"></a> </object>

yyy

i want the result xxx yyy

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — , Dec 21 '10 at 20:26

score 1 · Answer 1 · answered Dec 21 '10 at 20:26

1

If you know exactly what the tag will be, a non-regex search and replace will be faster and more efficient. How much do you know of the tag's form?

Also, regex & HTML is a Bad Thing.

answered Dec 21 '10 at 20:26

ssube

47,010
7
103
140

A tag usually has attributes, so just a string replace will likely be insufficient. – Uwe Keim Dec 21 '10 at 20:28
this html in string text i try to convert this text to HtmlDocument but i can not that need to convert must be WebBrowser webControl = new WebBrowser(); webControl.DocumentText = string_txt; but gave me a securety error message – bebo Dec 21 '10 at 20:30
@Uwe Keim: that's why I was asking how much was known, if neither the tag nor attributes change (which could be possible, since the whole thing was posted in the question), then it wouldn't be a worry (known/constant string). – ssube Dec 22 '10 at 03:45
@user545142: that would suggest that you can't retrieve the text from the browser, it sounds like. Check what privileges that bit of code is running with and what the specific error is. – ssube Dec 22 '10 at 03:47

Jaifroid · Answer 2 · 2018-05-08T06:40:19.003

Although others are right that this would be easier using DOM methods, if you can't manipulate the DOM and your HTML is effectively just a string, then you can do this (assuming C#):

string resultString = null;
try {
    resultString = Regex.Replace(subjectString, 
        @"\s+<(object)\b[^>]*>(?:[^<]|<(?!/\1))*</\1>\s*", " ", RegexOptions.IgnoreCase);
} catch (ArgumentException ex) {
    // Error catching
}

This assumes that <object is the only part of this that might not change and that the tag is always closed with </object>.

EDIT: Explanation: The regex searches for any white space, then for <object, then it looks for anything that is not a closing angle bracket, followed by the closing angle bracket of object, then it searches for anything that is not an open-angle bracket or anything that is an open-angle bracket not followed by /object (referred to via backreference \1), as many times as possible, followed by </object> (using backreference \1 again), and finally any white space. It then replaces what has been matched with a single space.

EDIT2: For efficiency, I used \s+ at the beginning of the regex, which means it will only match if there is at least one whitespace character (which can include newline) before <object. However, if your original HTML could have, say, xxx<object(e.g., HTML string is minified) then change \s+ to \s*. Whether \s+ or \s* is more efficient depends on how optimized the C# regex engine is in the version/system/OS you're targetting. So experiment to find out which matches faster.

EDIT3: The regex can be further simplified to this: \s+<(object)\b(?:[^<]|<(?!/\1))*</\1>\s*.

score 1 · Answer 3 · answered May 08 '18 at 06:46

Why use regex when you can simply use IndexOf?

string html = "...";
int start;
while ((start = html.IndexOf("<object")) >=0)
{
    int end = html.IndexOf("</object>", start);
    html = html.Remove(start, end-start + "</object>".Length);
}
// now 'html' contains the html without object tags

Explanation:

Find the first occurrence of <object
Find the start of the next closing tag
Remove that part including the whole closing tag
Repeat until no object tags are left

RegularExpressions to Remove specific HTML tag

3 Answers3