1

I have some HTML content that I need to modify using C#. It is conceptually simple but I'm not sure how to do it efficiently. The content contains several occurrences of delimited numbers followed by an empty anchor tag. I need to take the delimited number and insert it into a JavaScript function call in the anchor tag. E.G.

The source string would contain something like this:

%%1%%<a href="#"></a> 
<p>A bunch of HTML markup</p>

%%2%%<a href="#"></a>
<p>Some more HTML markup</p>

I need to transform it to this:

<a href="#" onclick="DoSomething('1')></a> 
<p>A bunch of HTML markup</p>

<a href="#" onclick="DoSomething('2')></a>
<p>Some more HTML markup</p>

There is no limit to the number of %%\d+%% occurrences. I took a crack at writing a regular expression in hopes I could use the Replace method, but I'm not sure if that can even work with multiple instances of each group. Here's what I had:

%%(?<LinkID>\d+)%%(?<LinkStart><a[\s\S]*?)(?:(?<LinkEnd>>[\s\S]*?)(?=%%\d+|$))

// %%(?<LinkID>\d+)%%        Match a number surrounded by %% and put the number in a group named LinkID
// (?<LinkStart><a[\s\S]*?)  Match <a followed by any characters until next match (non greedy), in a group named LinkStart
// (?:                       Logical grouping that does not get captured
// (?<LinkEnd>>[\s\S]*?)     Match > followed by any characters until next match, in a group named LinkEnd
// (?=%%\d+%%|$)             Where the former LinkEnd group is followed by another instance of a delimited number or the end of the string. (I don't think this is working as I intended.)

Maybe some combination of a couple Regex operations and String.Format could be used. I'm not an expert at regular expressions.

halfer
  • 19,824
  • 17
  • 99
  • 186
xr280xr
  • 12,621
  • 7
  • 81
  • 125

4 Answers4

1

Using regex to parse HTML has been covered extensively on SO. The consensus is that it should not be done.

If you need to parse your HTML I would recommend using something like the HTML Agility Pack. This allows you to use something similar to xPath to identify which HTML you want to work on.

Community
  • 1
  • 1
Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
  • 1
    I'm getting the sense from the upvotes that some are not reading the question. I am not parsing HTML. The fact that the string contains HTML is trivial. I can edit the OP and remove all traces of HTML and end up with the same question. Are are you more trying to say a Regex shouldn't be used to parse a string? That may be. I included the Regex cause that's the route I had tried, but am willing to accept it may not be the right approach. That being said, I'm still looking for a way to solve this problem, not just what not to do. I'm not dealing with a valid XML string so XML parsers won't work. – xr280xr Jun 05 '12 at 19:28
  • I removed "Regex" from the title so as to not stifle creativity. Thanks for your input. – xr280xr Jun 05 '12 at 19:30
  • Am I trying to say regex shouldn't be used to parse a string? No, that's what it is for. Just to be clear, I didn't just tell you want not to do, I also pointed you towards the HTML Agility Pack which is intended to help you parse HTML (even malformed HTML) and NOT XML as you seem to be suggesting. If that doesn't work for you then I would recommend the standard string methods. Either way, regex is one of the last ways I would attempt to solve this problem. – Abe Miessler Jun 05 '12 at 20:18
  • True. I appreciated your suggestion, but in reading the linked question, it sounded as if it were based off a wrong assumption and when I looked at the Agility Pack, nothing jumped out at me as something that would get me any farther towards solving this issue. I always think of HTML as a form of XML markup...pardon the mistake. – xr280xr Jun 05 '12 at 22:04
1

I would say your regex is pretty much what you want - I've changed it slightly. This would work if $ matches only at the end of the string:

%%(\d+)%%(<a[^>]*)(></a>)(.*?)(?=%%\d|$)

If you decide to use this, then for each match you have access to the groups and this way you can construct the new string - that will be probably easier than replacing stuff in the existing string.

Joanna Derks
  • 4,033
  • 3
  • 26
  • 32
0

I would use string.split for this one.

string emptyAnchor = "<a href=""#""></a>";
string src = GetData();
string[] splits = src.split(new string[]{"%%"}, StringSplitOptions.None);
StringBuilder sb = new StringBuilder();

//first entry is blank, set to 1
int i = 1;
while(i < splits.length)
{
    string id = splits[i];
    //increment for data string
    i++;
    //prehaps use a StringReplaceFirstOccurrence function instead
    sb.Append(splits[i].Replace(emptyAnchor, GetDataFromID(id)));
    i++;
}
string output = sb.ToString();
Biff MaGriff
  • 8,102
  • 9
  • 61
  • 98
  • I think this is on the right track for using string methods. It got me thinking that maybe I could just split it on the delimiters and use Regex.Replace from there since there would only be 1 occurence to modify in each set. But before I tried that, I decided to just try Regex.Replace with the basic expression to see if it could handle it and it does. So you gave me some inspiration and an alternative. Thanks! – xr280xr Jun 05 '12 at 21:57
0

Turns out Regex.Replace is already smart enough to handle multiple matches. I just modified my regex to not use the look ahead. The idea is I find the number inside the %% delimiters and add it to a group, find the content inside the next anchor tag and add it to a group, then replace the whole match with a new version that has the text captured in the two groups inserted into it. The replace method seems to automatically handle subsequent matches correctly without any additional help.

string originalText = "<h3>%%1%%<a href=\"#\">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>" +
                            "<h3>%%2%%<a href=\"#\">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p>" +
                            "<p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>";

Regex regex = new Regex(@"%%(\d+)%%[\s]*<a[\s\S]*?>([\s\S]*?)</a>");
string result = regex.Replace(originalText, "<a href=\"#\" onclick=\"DoSomething($1)\">$2</a>");
Debug.WriteLine("Original Text: \"" + originalText + "\"");
Debug.WriteLine("Result Text: \"" + result + "\"");

Output:

Original Text: "<h3>%%1%%<a href="#">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h3>%%2%%<a href="#">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p><p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>"

Result Text: "<h3><a href="#" onclick="DoSomething(1)">First Spot</a></h3><p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p><h3><a href="#" onclick="DoSomething(2)">Second Spot</a></h3><p>Ut vulputate lobortis feugiat.</p><p>Ut nunc diam, malesuada iaculis viverra nec, auctor eget velit.</p>"
xr280xr
  • 12,621
  • 7
  • 81
  • 125