Parsing multiple groups

Question

I have a HTML file(I can't use HTML AgilityPack) that I want to extract the id of a div(if it has one)

<div id="div1">Street ___________________ </div>
<div id="div2">CAP |__|__|__|__|__| number ______ </div>
<div id="div3">City _____________________ State |__|__|</div>
<div id="div4">City2 ____________________ State2 _____</div>

I have a pattern for extracting underscores __ : [\ _]{3,}

Now if I have a div in front of my underscores I want to extract it, if not I'll get only the underscores.

I have build so far this pattern (<div id(.+?)>(\w)([\ _]{3,}/*))([\ _]{3,})

The first part is build out of 3 groups 1 - a div tag, 2 - a label, 3 - underscores

1 - <div id(.+?)>, 2 - (\w) , 3 - [\ _]{3,}/*

The div with the id div2 will not take the id because it contains non-alfanumeric chars.

Q: What is wrong with my pattern ?

Desired matchs for the 4 divs:

<div id="div1">Street ___________________
______ 
<div id="div3">City _____________________
<div id="div4">City2 ____________________
_____

Aaaah... html parsing with a regex!!! http://stackoverflow.com/q/1732348/613130 — xanatos, Aug 07 '13 at 09:35
@xanatos: Its not really HTML parsing because the requirements don't worry about nesting items which is the main problem with parsing via regex. — Chris, Aug 07 '13 at 09:40
So you can have div WITHOUT id and div WITH id, and you want to extract the id (if present) and the content of these div(s), right? — xanatos, Aug 07 '13 at 09:44

Bernhard Barker · Accepted Answer · 2013-08-08T09:24:15.470

1

\w is just a single character, you probably want to say one or more - \w+.
/* - zero or more /'s? I don't see where that fits in.
One or more not >'s (i.e. [^>]+) is probably a better idea than .+?. .+? will try to stop at the first >, but will continue until it finds a string that matches, i.e.:
```
<div id=1>this is not valid</div><div id=2>this is valid___</div>
```
will match the whole string, instead of just from <div id=2>.
As far as I can tell from your question, everything before the underscores should be optional.

Pattern:

(?:(<div id[^>]+>)(\w+))?([\ _]{3,})

C# Test.

edited Aug 08 '13 at 09:24

answered Aug 07 '13 at 09:48

Bernhard Barker

54,589
14
104
138

I think you are missing a parenthesis in your pattern. If I add a ) : (?:(
]+?)>(\w+))?([\ _|]{3,})(?:(\w+)([\ _|]{3,}))?) then in the second match I capture this string |__|__|__|__|__| witch is not alfanumeric.
– Misi Aug 08 '13 at 07:18
Edited my answer, it should be closer to what you want now. – Bernhard Barker Aug 08 '13 at 08:44
In this line "
City _____________________ State |__|__|
" it only captures "_____________________". "City" contains alphanumeric chars so it should have been matched. – Misi Aug 08 '13 at 09:08
Seems I accidentally left the escape characters there. Fixed. – Bernhard Barker Aug 08 '13 at 09:24
This works better. If I want to allow spaces why doesn't it work if I add "\s" ? (\w\s+) – Misi Aug 08 '13 at 10:00
1

`\w\s+` means one word character and one or more white-space characters (not one or more of each). To say one or more of either, you can say `[\w\s]+` or `(\w|\s)+`. – Bernhard Barker Aug 08 '13 at 10:05

score 1 · Answer 2 · answered Aug 07 '13 at 09:51

1

Try something like

string html = @"<div id=""div1"">Street ___________________ </div>
<div id=""div2"">CAP |__|__|__|__|__| number ______ </div>
<div id=""div3"">City _____________________ State |__|__|</div>
<div name=""hello"" id=""div4"">City _____________________ State |__|__|</div>
<div name=""house"">City _____________________ State |__|__|</div>
<div id=""notext""></div>";

var rx = new Regex(@"<div(?:(?: id=""(?<id>[^""]+)"")|[^>])*>(?<content>[^<]*)</div>", 
                   RegexOptions.IgnoreCase);

var matches = rx.Matches(html);

foreach (Match match in matches)
{
    var id = match.Groups["id"];
    var content = match.Groups["content"];

    Console.WriteLine("id present: {0}, id: {1}, text: {2}", 
                      id.Success, 
                      id.ToString(), 
                      content.ToString());
}

if it work I'll explain the regex (that is <div(?:(?: id="(?<id>[^"]+)")|[^>])*>(?<content>[^<]*)</div>)

answered Aug 07 '13 at 09:51

xanatos

109,618
12
197
280

CAP **|__|__|__|__|__|** number ______ - this is the second match and it contains non-alfanumeric chars | and _ – Misi Aug 08 '13 at 07:13
@Misi AND??? It isn't clear what you want. My regex will simply extract all the
present in an html...
– xanatos Aug 08 '13 at 07:29
I don't want to extract the whole div if the content doesn't have alphanumeric characters. The Content for the 2nd match is : "CAP |__|__|__|__|__| number " – Misi Aug 08 '13 at 07:53
@Misi So you don't want the CAP line because it has the `|` ? Or you don't want only the CAP part and you want number ____ ? – xanatos Aug 08 '13 at 08:00
And what should happen with the City line? (that has the State ___ part)? – xanatos Aug 08 '13 at 08:04
I've edited my answer. "City" has all chars alphanumeric, "CAP |__|__|__|__|__| number" contains non-alphanumeric chars. – Misi Aug 08 '13 at 08:36

Parsing multiple groups

2 Answers2