Although others are right that this would be easier using DOM methods, if you can't manipulate the DOM and your HTML is effectively just a string, then you can do this (assuming C#):
string resultString = null;
try {
resultString = Regex.Replace(subjectString,
@"\s+<(object)\b[^>]*>(?:[^<]|<(?!/\1))*</\1>\s*", " ", RegexOptions.IgnoreCase);
} catch (ArgumentException ex) {
// Error catching
}
This assumes that <object
is the only part of this that might not change and that the tag is always closed with </object>
.
EDIT: Explanation: The regex searches for any white space, then for <object
, then it looks for anything that is not a closing angle bracket, followed by the closing angle bracket of object, then it searches for anything that is not an open-angle bracket or anything that is an open-angle bracket not followed by /object
(referred to via backreference \1
), as many times as possible, followed by </object>
(using backreference \1
again), and finally any white space. It then replaces what has been matched with a single space.
EDIT2: For efficiency, I used \s+
at the beginning of the regex, which means it will only match if there is at least one whitespace character (which can include newline) before <object
. However, if your original HTML could have, say, xxx<object
(e.g., HTML string is minified) then change \s+
to \s*
. Whether \s+
or \s*
is more efficient depends on how optimized the C# regex engine is in the version/system/OS you're targetting. So experiment to find out which matches faster.
EDIT3: The regex can be further simplified to this: \s+<(object)\b(?:[^<]|<(?!/\1))*</\1>\s*
.