1

Reading this question, it seems Regex is the solution to my problem.

This is the HTML I'm trying to split:

\n\t\t\t
    <td class=\"stats_name\">
        Damage \n\t\t\t

    <td class=\"stats_value\">
        53 \n\t\t\t

    <td class=\"stats_modifier\">
        (<span class=\"ability_per_level_stat\">+3.2 / per level</span>) \n\t\t\n\t\t  

    </td>

    </td>

    </td>

For my reasons, I need to split this on the <td string. This worked well enough with HtmlAgilityPack and String.Split, however the delimiter is removed and I need it present.

var statCells = rowDocument.DocumentNode.InnerHtml.Split(new string[] {"<td"}, StringSplitOptions.RemoveEmptyEntries).ToList();

And here's the same "function" using Regex to keep the delimeter, however it doesn't work as expected and is returning far too many strings, I think it's splitting on "<" "t" and "d" individually.

var statCells = Regex.Split(rowDocument.DocumentNode.InnerHtml, @"(?<=[<td])").ToList();

How can I use Regex.Split to split on "<td"?

Community
  • 1
  • 1
Only Bolivian Here
  • 35,719
  • 63
  • 161
  • 257
  • What do you mean with split on td? How does this not work with the htmlAgilityPack? if you do: doc.DocumentElement.SelectNodes("td"), you will perfectly get each td node including their tagname – Polity Dec 09 '11 at 02:24
  • @Polity: Try it! It doesn't work as you'd expect because these particular TD's don't have closing elements and the content is stretched to encompass everything until the end. :) – Only Bolivian Here Dec 09 '11 at 02:25
  • Got it! so you're not trying to split html after all ;) – Polity Dec 09 '11 at 02:28
  • @Polity: Not any *properly* formatted HTML, no. :P – Only Bolivian Here Dec 09 '11 at 02:29

1 Answers1

2

@"(?<=[<td])" is splitting on every < t or d because that's how character classes work. Use this if you want the <td at the beginning of the next string (rather than the end of the last one):

@"(?=<td)"

This is going to be slower than the original solution though. If you use String.Split and just concatenate each string with <td then that should work the same way but faster because you don't use regexen.

Dan
  • 10,531
  • 2
  • 36
  • 55