2

Say I have the following HTML string

<head>

</head>

<body>
<img src="stickman.gif" width="24" height="39" alt="Stickman">
<a href="http://www.w3schools.com">W3Schools</a>
</body> 

I want to add a string in between the <head> tags. So the final HTML string become

<head>
<base href="http://www.w3schools.com/images/">
</head>

<body>
<img src="stickman.gif" width="24" height="39" alt="Stickman">
<a href="http://www.w3schools.com">W3Schools</a>
</body> 

So I have to search for the first occurrence of the <head> string then insert <base href="http://www.w3schools.com/images/"> right after.

How do I do this in C#.

PutraKg
  • 2,226
  • 3
  • 31
  • 60
  • You can do this in more ways then one: -Regular Expressions -Splitting your text by a certain character and writing the data you have gotten with the line you need to add -Using an XMLreader/writer – Ken de Jong May 13 '13 at 08:10
  • I don't mind using Regex – PutraKg May 13 '13 at 08:13
  • 2
    Regex is really overkill for what you want to do here. Simple .NET string manipulation is good enough and a lot less complex. – pyrocumulus May 13 '13 at 08:18
  • Don't use RegEx when manipulating HTML/XML, because HTML is not *regular*, and RegEx is for manipulating Regular Expressions. – abelenky May 13 '13 at 08:25
  • @abelenky: Since when is RegEx for manipulating Regular Expressions? – Mesh May 13 '13 at 08:39
  • Correction: RegEx is for manipulating Regular Languages, which HTML is not. (http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html) – abelenky May 13 '13 at 08:52
  • What you want to do is too simple for a Regex but, in any case Regex is the wrong tool for the job. Have you seen this answer, it got a couple of votes http://stackoverflow.com/a/1732454/659190, what if the head tag is `

    `?

    – Jodrell May 13 '13 at 09:23
  • possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Jodrell May 13 '13 at 09:29
  • I do not know about XHTML self contained tags, you are misunderstanding my question perhaps as they are unrelated. My question, is not related to Regex directly. The one you suggested is overkilled for my needs. – PutraKg May 13 '13 at 12:17

4 Answers4

7

So why not just do something easy like

myHtmlString.Replace("<head>", "<head><base href=\"http://www.w3schools.com/images/\">");

Not the most elegant or expandable, but satisfies the conditions of your question.

dav_i
  • 27,509
  • 17
  • 104
  • 136
  • For some reason it does not found the `' tag if there's some other string before it – PutraKg May 13 '13 at 08:25
  • The above doesn't care if there is something before or after it. There must be something else going on if it's not working. – dav_i May 13 '13 at 08:58
  • @PutraKg, your question doesn't have any text before the head tag. Is this question about all html? http://stackoverflow.com/a/1732454/659190 – Jodrell May 13 '13 at 09:27
  • The question showed is just one example. I am testing the answer against real websites too. Sorry for not mentioning that as I forgot to consider against `doctype` etc when writing the question. – PutraKg May 13 '13 at 12:13
5

Another way of doing this:

string html = "<head></head><body><img src=\"stickman.gif\" width=\"24\" height=\"39\" alt=\"Stickman\"><a href=\"http://www.w3schools.com\">W3Schools</a></body>";
var index = html.IndexOf("<head>");

if (index >= 0)
{
     html = html.Insert(index + "<head>".Length, "<base href=\"http://www.w3schools.com/images/\">");
}
gzaxx
  • 17,312
  • 2
  • 36
  • 54
2

This is how can it be done with Regex, if you prefer to use it

public string ReplaceHead(string html)
{
    string rx = "<head[^>]*>((.|\n)*?)head>";
    Regex r = new Regex(rx);
    MatchCollection matches = r.Matches(html);
    string s1, s2;
    Match m = matches[0];
    s1 = m.Value;
    s2 = "<base href="http://www.w3schools.com/images/">" + s1;
    html = html.Replace(s1, s2);
    return html;
}
Draykos
  • 773
  • 7
  • 16
  • no, it's not. Question was not: "which is the easier way to do it", but "how can it be done with regex" (but now it has been edited) Anyway can be useful to check what you have inside the tag before doing the replace, that's why I uses a similar solution in a project of mine – Draykos May 13 '13 at 08:48
  • I would use this but some people suggested that in my situation regex would probably be over-killed. So I edited my question and remove the regex preference. Anyway, I appreciate your answer. – PutraKg May 13 '13 at 08:59
  • 1
    It's overkill if you are sure to have only 1 tag, and you are sure the string you'll add it's not already inside the tag. With my solution you can check if "http://www.w3schools.com/images" etc is already inside before doing replace. Of course this is silly in the example, but it can be different for a real case – Draykos May 13 '13 at 09:01
  • As one of the answer of the thread Jodrell has linked: "I like to parse HTML with regular expressions. I don't attempt to parse idiot HTML that is deliberately broken." – Draykos May 13 '13 at 09:52
1

Just replace the HEAD's tail, in HTML there should only be one:

"<head></head>".Replace( "</head>" , "<a href=\"http://www.w3fools.com\">W3Fools</a>" + "</head>" );

You can flip this around to and replace the HEAD's open, to insert a tag at the beginning.

If you need anything more complex then you should look into using parsed HTML.

Mesh
  • 6,262
  • 5
  • 34
  • 53