I am trying to manipulate an HTML file that I'm downloading using WebClient
. I am using Regex
to extract the values of the href
attribute. Here is my code:
string html = "";
WebClient webClient = new WebClient();
html = webClient.DownloadString(addressTextBox.Text).Replace("\n", "").Replace("\t", "");
address = webClient.BaseAddress;
StringBuilder stringBuilder = new StringBuilder(html);
MatchCollection matchCollection = Regex.Matches(html, @"(?<=\bhref="")[^""]*");
int offset = 0;
foreach (Match match in matchCollection)
{
string newValue = addressTextBox.Text + match.Value.Replace("./", "").Replace("../", "");
int tempOffset = match.Index - offset;
stringBuilder.Remove(tempOffset, match.Length);
stringBuilder.Insert(tempOffset, newValue);
offset = newValue.Length - match.Length;
}
webBrowser.DocumentText = stringBuilder.ToString();
File.WriteAllText(@"C:\Users\Admin\Documents\site.xml", stringBuilder.ToString(), Encoding.UTF8);
Here is what I'm trying to do:
- I am trying to get the index of where the value of an
href
attribute is - I am trying to remove the attribute's value
- I am inserting a new value to replace the old one
- Since the new attribute value is generally larger than the old one, I have created an
offset
variable to store the difference between the previous attribute value's length and the new one. Then, I am subtracting the offset from the next match's index
Below is a screenshot of the damage that occurs after I try to manipulate the web page:
What am I doing wrong? How do I correctly replace the values of each href
attribute?