2

I am working on manipulating/extracting data from well-formed HTML in one of our legacy systems. I need to use regex to parse the HTML, find certain patterns, extract the data, and return some modified HTML. I know that regex and HTML are never the answer but, given that I know exactly where the data is coming from and that the data is properly structure, I am confident that this will work for the particular situation.

The HTML that I am working with has the following pattern:

<i>Name1</i>: Some text goes here<br/>
<i>Name2</i>: Some different text goes here<br/>
<i>Name3</i>: Some other different text goes here<br/>

I need to change the HTML to the following:

<i>Name1</i><p>Some text goes here</p>
<i>Name2</i><p>Some different text goes here</p>
<i>Name3</i><p>Some other different text goes here</p>

Basically, I want to take the inner text, wrap it in a p tag and then remove the trailing br.

I want to do something like the following:

Dim HTML as String = [The HTML goes here]
html = Regex.Replace(html, "</i>:(.+?)<br\s*\/?>", "</i><p>(.+?)</p>", RegexOptions.Multiline)

but it obviously isn't working.

In VB.net, how do I replace all desired instances of HTML with the new HTML?

  • 1
    Try [Html Agility Pack](http://htmlagilitypack.codeplex.com/). I personally never tried it myself, but it was suggested many times here, so it must be good enough. – Victor Zakharov Nov 26 '12 at 16:42

2 Answers2

2

I suggest using the HTML Agility Pack to parse and manipulate HTML (in particular if the format of the HTML is not regular). The source download comes with a bunch of example projects, so you can see how to use it.

In general Regex is not a good solution for parsing HTML.

Community
  • 1
  • 1
Oded
  • 489,969
  • 99
  • 883
  • 1,009
1

Give this a shot:

Dim HTML as String = [The HTML goes here]
Dim evaluator As MatchEvaluator = Function(m As Match)
                                  Return "</i><p>" & m.Groups(1).Value & "</p>"
                                  End Function
html = Regex.Replace(html, "</i>:(.+?)<br\s*\/?>", evaluator, RegexOptions.Multiline)
NakedBrunch
  • 48,713
  • 13
  • 73
  • 98