0

How can I replace Html inside pre tag? I would prefer to do that with Regex

<html>
<head></head>
<body>
<div>
<pre>

    <html>
    <body>
    -----> hello! ----< 
    </body>
    </html

</pre>
</div>
</body>
Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
Jordon Willis
  • 4,851
  • 7
  • 27
  • 34
  • 1
    Somehow, this sounds like a bad idea. If one could tell what the question really means. – Tim Pietzcker Feb 16 '11 at 12:13
  • For starters, that doesn't even seem like valid HTML. – R. Martinho Fernandes Feb 16 '11 at 12:14
  • 1
    You do not say what the result should look like or what the input looks like (your makeshift sample probably does not reflect reality), where it comes from and why you want to do it with regex. In this form, this is hardly a question. – Tomalak Feb 16 '11 at 12:15
  • I agree.. what's the question? You want to change the text within the
     and 
    ?
    – Theun Arbeider Feb 16 '11 at 12:15
  • You're best off telling us a bit more background so we can supply a decent solution – m.edmondson Feb 16 '11 at 12:15
  • @Levisaxos - Yes I need to change all html tags within
     tag.
    – Jordon Willis Feb 16 '11 at 12:29
  • Where does the html come from? That is is this the mark up of a page or data from database that needs to be inserted into an ASP.NET page ? Where is the code running? That is GUI or ASP.NET application. Some more background information will help us understand your question better and what difficulties you've had. – Shiv Kumar Feb 16 '11 at 12:30
  • Just a thought but hasn't regex parsing of html already been shown to be a bad idea? See here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – bic Feb 16 '11 at 13:18

2 Answers2

0

EDIT: As indicated by another answer, regex does not support HTML or XHTML completely, and so you will be better off using an HTML parser instead. I'm leaving my answer here for reference though.

What do you want to replace the content inside the pre-tags with?

I'm not familiar with the specific C# syntax, but provided C# uses Perl-style regexes, the following PHP-snippet might be helpful. The code below will replace the content inside the pre-tags with the string "(pre tag content was here)" (just tested with the command line PHP client):

<?php
$html = "<html><head></head><body><div><pre class=\"some-css-class\">
         <html><body>
         -----> hello! ----< 
         </body></html
         </pre></div></body>"; // Compacting things here, for brevity

$newHTML = preg_replace("/(.*?)<pre[^<>]*>(.*?)<\/pre>(.*)/Us", "$1(pre tag content was here)$3", $html);
echo $newHTML;
?>

The ? mark is to make the matching non-greedy (stop at first occurence of what comes after), and the mU modifiers specifies "Unicode-character-support" and "single-line support". The latter is important to make . match newlines also. The [^<>]* part is for supporting attributes in the pre tag, such as <pre class="some-css-class"> (it will match any number of characters except for < or >.

UPDATE: As indicated by Martinho Fernandes in the comments below, the C# syntax for the above regex should be something like:

new Regex(@"(.*?)<pre[^<>]*>(.*?)<\/pre>(.*)", RegexOptions.SingleLine)
Community
  • 1
  • 1
Samuel Lampa
  • 4,336
  • 5
  • 42
  • 63
  • Work for the example. Fails for `
    foo
    `. Don't know if it matters for the OP, though.
    – R. Martinho Fernandes Feb 16 '11 at 12:39
  • Ah, true. Will modify my answer. – Samuel Lampa Feb 16 '11 at 12:40
  • @Marthinho & Samuel: Simple remove the \> after
    – Theun Arbeider Feb 16 '11 at 12:43
  • 1
    @Levisaxos: True, that would work, but would be a little bit unambiguous (in case of any other tags starting with "pre"). Updated with a solution that allows any number of non-<> chars in the pre tag. – Samuel Lampa Feb 16 '11 at 12:47
  • `
    FAIL"">
    `. Seriously, [stop trying](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454).
    – R. Martinho Fernandes Feb 16 '11 at 12:53
  • Updated the modifiers now. `/mUs` was should be `/Us`. s was the important thing, as it makes `.` match newlines, which is the more convenient way to do things (otherwise one probably has to include stuff like `^` and `$` for line starts and endings...). – Samuel Lampa Feb 16 '11 at 13:00
  • Oops, in the last comment I meant `
    `
    – R. Martinho Fernandes Feb 16 '11 at 13:02
  • @Martinho Fernandes: True, it won't support complete (X)HTML. Still, for many everyday cases, I found this type of regex very useful (you don't always have the time to take a full-blown XML/HTML parser into use). – Samuel Lampa Feb 16 '11 at 13:17
  • In any ways, the regex you supply still isn't a valid C# Regex code. when I tried it I got several errors or just no results. – Theun Arbeider Feb 16 '11 at 14:13
  • @Levisaxos: it works in C# if you translate the syntax by removing the slashes and the options at the end (which you will need to supply to the Regex ctor with `RegexOptions`), leaving you with `new Regex(@"(.*?)
    ]*>(.*?)<\/pre>(.*)", RegexOptions.SingleLine)` (.NET regexes/strings are Unicode already).
    – R. Martinho Fernandes Feb 16 '11 at 14:35
  • @Samuel: it depends on where the HTML comes from. The OP's example doesn't seem to come from a sanitized source because it isn't even valid HTML. In addition, the OP wants to do DOM manipulation (replace the contents of a tag), while regexes are a tool for text manipulation. I don't see why using a HTML parser would make it more time consuming than coming up with an adequate Regex. – R. Martinho Fernandes Feb 16 '11 at 14:52
  • @Martinho Fernandes: Indeed, probably true in this case. – Samuel Lampa Feb 16 '11 at 15:27
0

RegEx match open tags except XHTML self-contained tags

Thank you martinho fernandes

Community
  • 1
  • 1
Theun Arbeider
  • 5,259
  • 11
  • 45
  • 68