0

I need to parse html that is formatted in the manner of the code sample below. The issue I have is that the field name can be wrapped in tags that have variable background or color styles. The pattern I am looking for is
tag, ignore any span that wraps text followed by a colon (this is the pattern
id: without an span tag wrapping). Matching this pattern should give me the key name and whatever follows the key name is the key value, until the next key name is hit. Below is a sample of the html I need to parse.

string source = "
<br />id: Value here
        <br /><SPAN style=\"background-color: #A0FFFF; color: #000000\">community</SPAN>: Value here
        <br /><SPAN style=\"background-color: #A0FFFF; color: #000000\">content</SPAN><SPAN style=\"background-          color: #A0FFFF; color: #000000\">title</SPAN>: Value here
"
//split the source into key value pairs based on the pattern match.

Thanks for any help.

lance-p
  • 1,050
  • 1
  • 14
  • 28
  • 3
    take a look here: http://stackoverflow.com/a/1732454/3227403 – pid Aug 23 '14 at 13:24
  • @pid, he's just trying to parse a well defined structure where the delimiters happen to be shaped like HTML elements, so I don't think we need to worry about accidentally summoning Cthulhu. In other words: http://stackoverflow.com/a/1733489/2611587 – Steve Ruble Aug 23 '14 at 13:37
  • @SteveRuble in fact mine was not an answer but a comment :) – pid Aug 23 '14 at 13:50

1 Answers1

2

Here's some code that'll parse it, assuming that your example HTML should have another <br /> element after `content'.

string source = @"
  <br />id: Value here
  <br /><SPAN style=""background-color: #A0FFFF; color: #000000"">community</SPAN>: Value here
  <br /><SPAN style=""background-color: #A0FFFF; color: #000000"">content</SPAN>
  <br /><SPAN style=""background-color: #A0FFFF; color: #000000"">title</SPAN>: Value here";

var items = Regex.Matches(source,@"<br />(?:<SPAN[^>]*>)?([^<:]+)(?:</SPAN>)?:?\s?(.*)")
         .OfType<Match>()
         .ToDictionary (m => m.Groups[1].Value, m => m.Groups[2].Value)
         .ToList();
Steve Ruble
  • 3,875
  • 21
  • 27
  • Thanks. This doesn't look like it is parsing all the key value pairs as I expect. I was hoping this would be generic enough to have it return the key/val pairs based on the pattern. In my example string it would be parsed as: items[0].Key = "id" items[0].Value = "Value here" items[1].Key = "community" items[1].Value = "Value here" items[2].Key = "content" items[2].Value = "" items[3].Key = "title" items[3].Value = "Value here" – lance-p Aug 25 '14 at 21:13
  • @user971823, a `Dictionary` is a list of key/val pairs. I've added a `ToList()` call so that the value of `items` will conform to the example result in your comment. – Steve Ruble Aug 26 '14 at 10:37
  • Thanks again. Is there a way to make this more generic so that the key name is the value based on the pattern (without having to specify the key names)? For example: instead of .ToDictionary (m => m.Groups[Use Pattern to Derive Key Name and Values].Value. – lance-p Aug 26 '14 at 15:06
  • @user971823, I'm not sure if I understand your question. I've updated my answer code to remove the named captures, if that's what you were worried about. – Steve Ruble Aug 26 '14 at 15:12
  • I needed to copy your entire code block with the string source. That's exactly what I needed! Thanks Steve! – lance-p Aug 26 '14 at 15:21