-2

Looking to capture from html into groups as Header, Name and Val

The HTML can vary, but this is how it would typically look

<div>
   <h5>Header 1</h5>
      <strong>Name1</strong>
          &nbsp;
          Value 1 <br>
      <strong>Name2</strong>
          &nbsp;
          Value 2 <br>
   <div>
   <h5>Header 2</h5>
      <strong>Name1</strong>
          &nbsp;
          Value 1 <br>
          Value 1 continued
      <strong>Name2</strong>
          &nbsp;
          Value 2 <br>
   <h5>Header 3</h5>
      <strong>Name1</strong>
          &nbsp;
          Value 1 <br>
          Value 1 continued
      <strong>Name2</strong>
          &nbsp;
          Value 2 <br>
   <br>
   </div>
</div>

This what I started using, but this relies that nothing is after the <br>

string pattern = "((<h5>(?<Header>.*?)<\\/h5>)|(<strong>(?<Name>.*?)<\\/strong>)|(&nbsp;(?<Val>.*?)<br>))
monsey11
  • 243
  • 4
  • 18
  • 4
    There are [much](https://www.nuget.org/packages/AngleSharp/) [better](https://www.nuget.org/packages/HtmlAgilityPack/) [tools](https://www.nuget.org/packages/CsQuery/) available for this job, don't use regex. – Lucas Trzesniewski Aug 20 '15 at 16:53
  • 2
    [Obligatory link on Regex and HTML](http://stackoverflow.com/a/1732454/1958365). – Equalsk Aug 20 '15 at 16:55
  • I recommend reading http://stackoverflow.com/a/1732454/1945631 – Andy Brown Aug 20 '15 at 17:02
  • I'd stress this part: *The HTML can vary, but this is how it would typically look*. Are you interested in an HTML parser based solution? I hope you should be, sincewith the parser, you will be able to do 2 things safely: extract the text and convert entities to literals. – Wiktor Stribiżew Aug 20 '15 at 17:23
  • @LucasTrzesniewski CsQuery looks like a great tool for this. Can you help me how I would select the above groups or a link with more detailed examples? Thanks – monsey11 Aug 20 '15 at 17:24

2 Answers2

-1

Remove the concurrences of "br" tag from the input, voila. str.Replace("<br>",""), etc.

Dmitry Sadakov
  • 2,128
  • 3
  • 19
  • 34
-1

I Changed the pattern to

string pattern = "(((?<=<h5>)(?<Header>.*?)(?=<\\/h5>))|((?<=<strong>)(?<Name>.*?)(?=<\\/strong>))|((?<=<\\/strong>)(?<Val>.*?)((?=<h5>)|(?=<strong>)|(?=<\\/div>))))";

Seems to be working. If you have a cleaner better answer, I will set your answer as the correct one.

monsey11
  • 243
  • 4
  • 18