0

I am trying to manipulate an HTML file that I'm downloading using WebClient. I am using Regex to extract the values of the href attribute. Here is my code:

string html = "";
WebClient webClient = new WebClient();

html = webClient.DownloadString(addressTextBox.Text).Replace("\n", "").Replace("\t", "");
 address = webClient.BaseAddress;

StringBuilder stringBuilder = new StringBuilder(html);
MatchCollection matchCollection = Regex.Matches(html, @"(?<=\bhref="")[^""]*");

int offset = 0;

foreach (Match match in matchCollection)
{
    string newValue = addressTextBox.Text + match.Value.Replace("./", "").Replace("../", "");
    int tempOffset = match.Index - offset;

    stringBuilder.Remove(tempOffset, match.Length);
    stringBuilder.Insert(tempOffset, newValue);
    offset = newValue.Length - match.Length;
}

webBrowser.DocumentText = stringBuilder.ToString();
File.WriteAllText(@"C:\Users\Admin\Documents\site.xml", stringBuilder.ToString(), Encoding.UTF8);

Here is what I'm trying to do:

  1. I am trying to get the index of where the value of an href attribute is
  2. I am trying to remove the attribute's value
  3. I am inserting a new value to replace the old one
  4. Since the new attribute value is generally larger than the old one, I have created an offset variable to store the difference between the previous attribute value's length and the new one. Then, I am subtracting the offset from the next match's index

Below is a screenshot of the damage that occurs after I try to manipulate the web page:

Screenshot

What am I doing wrong? How do I correctly replace the values of each href attribute?

Razor
  • 1,778
  • 4
  • 19
  • 36
  • 1
    Trying to manipulate HTML via regexes is about the most difficult and error-prone way possible. Use HTML Agility Pack. – Dour High Arch Dec 05 '17 at 00:35
  • 2
    Here is very good highly up-voted explanation what you are doing wrong - https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/... More info on HAP - https://stackoverflow.com/questions/846994/how-to-use-html-agility-pack (in addition to explanation how to do what you just asked which is reasonable duplicate from my point of view - https://stackoverflow.com/questions/12912632/need-to-replace-href-of-anchor-tags-in-a-string) – Alexei Levenkov Dec 05 '17 at 00:51
  • Thank you for your responses. I will try to use HTML Agility Pack. – Razor Dec 05 '17 at 02:34
  • Thanks Alexei for sharing those posts. They were really helpful. – Razor Dec 05 '17 at 02:39

0 Answers0