0

I have the following fragment of html:

<p>​<a href=\"/es-es/Documents/test.txt\"><img class=\"ms-asset-icon ms-rtePosition-4\" src=\"/_layouts/15/images/ictxt.gif\" alt=\"\" />test.txt</a><a href=\"/es-es/Documents/test%20-%20Copy.txt\"><img width=\"16\" height=\"16\" class=\"ms-asset-icon ms-rtePosition-4\" src=\"/_layouts/15/images/ictxt.gif\" alt=\"\" />test - Copy.txt</a><a href=\"/es-es/Documents/test%20-%20Copy%20(2).txt\"><img width=\"16\" height=\"16\" class=\"ms-asset-icon ms-rtePosition-4\" src=\"/_layouts/15/images/ictxt.gif\" alt=\"\" />test - Copy (2).txt</a></p>

This html is in a string. I need to strip out the hrefs from all the links and am not sure how to go about this.

NOTE: I left the string as is that's why its not formatted on multiple lines of code...

tshepang
  • 12,111
  • 21
  • 91
  • 136
Orlando
  • 935
  • 2
  • 20
  • 42

4 Answers4

1

HtmlAgilityPack is the most recommended tool to parse and manipulate HTML.

Some starting code would look like following (more samples are one search away):

var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(htmlString);  
var aNodesWithHref = htmlDoc.DocumentNode.SelectNode("//a[@href]");
Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
1

Try this. You could easily achieve the expected result using XML manipulation.

string s = "<p>​<a href=\"/es-es/Documents/test.txt\"><img class=\"ms-asset-icon ms-rtePosition-4\" src=\"/_layouts/15/images/ictxt.gif\" alt=\"\" />test.txt</a><a href=\"/es-es/Documents/test%20-%20Copy.txt\"><img width=\"16\" height=\"16\" class=\"ms-asset-icon ms-rtePosition-4\" src=\"/_layouts/15/images/ictxt.gif\" alt=\"\" />test - Copy.txt</a><a href=\"/es-es/Documents/test%20-%20Copy%20(2).txt\"><img width=\"16\" height=\"16\" class=\"ms-asset-icon ms-rtePosition-4\" src=\"/_layouts/15/images/ictxt.gif\" alt=\"\" />test - Copy (2).txt</a></p>";
var xdoc = XDocument.Parse(s);
            xdoc.Descendants("a")
            .Attributes("href")
            .Remove();
        Console.WriteLine(xdoc.ToString());
Sriram Sakthivel
  • 72,067
  • 7
  • 111
  • 189
  • Since I wanted to get the hrefs not remove them I used this portion of the code: var xdoc = XDocument.Parse(s).Descendants("a").Attributes("href"); – Orlando Oct 24 '13 at 19:33
  • Since the `img` tags are XHtml this should be fine. If your input can't be guaranteed to be an XHtml fragment, this is not a generalizeable answer as HTML `img` tags aren't self-closed. (Nor many others). – Tetsujin no Oni Oct 25 '13 at 17:39
  • @TetsujinnoOni Of course, but I concentrated to solve specific problem to OP not a generalized answer which you can use anywhere though it works for well formatted xml – Sriram Sakthivel Oct 25 '13 at 18:33
  • Of course; I'm mainly concerned in the comments here with those who find the question in the future. – Tetsujin no Oni Oct 25 '13 at 19:16
0

You could use the AttributeCollection.Remove method

YourLink.Attributes.Remove("href");
Kristian
  • 21,204
  • 19
  • 101
  • 176
  • That'd be great if it wasn't in a raw string. – Tetsujin no Oni Oct 24 '13 at 18:44
  • ya, but shouldn't he be required to extract the link and store it in a variable before doing any subsequent steps? i mean, the alternative is to start regexing strings, and we all know how elegant that is... – Kristian Oct 24 '13 at 18:45
0

can you just replace it with Regex?

string newString = Regex.Replace(oldString, @"<a href[^>]+>", @"");
Jonesopolis
  • 25,034
  • 12
  • 68
  • 112