0

I'm not very clever at Regex and still learning how to utilize it, and I would really appreciate some help.

I have a string like this:

"Language<option value='32'>Bahasa Indonesia<option value='19'>Dansk<option value='4'>Deutsch<option value='1'>English<option value='2'></option></option></option></option>"

I need to convert this into something I can work with, like:

public class CVModel
{
    public string value;
    public string content;
}

How do I use Regex to extract this information? I know I can use

"<.*?>"

to strip out the option tags replace them with a delimiter which I can use to split to a list. But how do I extract the "value" attribute? Thanks in anticipation!

Blorgbeard
  • 101,031
  • 48
  • 228
  • 272
  • 6
    I strongly suggest you use [HtmlAgilityPack](http://htmlagilitypack.codeplex.com/) for something like this, instead of RegEx. – Tim Dec 01 '15 at 19:44
  • 1
    Where is this string coming from? It seems slightly strange, that all these language options are nested... If you can influence the generation of this string I'd rather try something like `` or even better ``. This is easy to read with simple XML-classes (e.g. 'XmlDocument') – Shnugo Dec 01 '15 at 19:51
  • Any comment on why all the options are nested like that? I suspect that type of nesting will confuse most markup-language parsers, if the output is supposed to be a (flat) list of languages and their associated IDs – Gus Dec 01 '15 at 19:53
  • @Shnugo It's a mistake to treat HTML as XML. Even the W3C backed away from XHTML. HTML just isn't XML. – spender Dec 01 '15 at 19:53
  • @spender, it seems to be clear to you, that this is (X)HTML. Where do you take this from? Could you give me a link to read about this? – Shnugo Dec 01 '15 at 19:55
  • @spender, whoops reading the title would have helped :-) – Shnugo Dec 01 '15 at 19:57
  • Yes guys I am aware that while they work for the website I scraped it from (using HTMLAgilityPack), they would have worked better if they weren't nested. In that case I wouldn't have needed Regex at all, and would simply just get the attribute and inner text values. But this arrangement got me a little stumped. – Khawar Nadeem Dec 02 '15 at 21:31

1 Answers1

0

This is just a simple regex that fits your example string. It doesn't try to account for all the things you can do with HTML like add additional attributes on the option tag, or non-numeric values, etc, but it's quick and works:

var regex=@"<option value='(?<value>\d+)'>(?<content>[^<]*)";
var search=@"Language<option value='32'>Bahasa Indonesia<option value='19'>Dansk<option value='4'>Deutsch<option value='1'>English<option value='2'></option></option></option></option>";
var list=System.Text.RegularExpressions.Regex
  .Matches(search,regex)
  .Cast<Match>()
  .Select(match => new { 
    Value=match.Groups["value"].Value, 
    Content=match.Groups["content"].Value});

Output: enter image description here

Your HTML options shouldn't be nested like that either.

Robert McKee
  • 21,305
  • 1
  • 43
  • 57
  • Yes, I know they shouldn't be nested like that (see my comment above). Thank you for a solution, I will try this in the morning and then accept your answer :) Appreciate it. – Khawar Nadeem Dec 02 '15 at 21:32