0

I have a HTML file(I can't use HTML AgilityPack) that I want to extract the id of a div(if it has one)

<div id="div1">Street ___________________ </div>
<div id="div2">CAP |__|__|__|__|__| number ______ </div>
<div id="div3">City _____________________ State |__|__|</div>
<div id="div4">City2 ____________________ State2 _____</div>

I have a pattern for extracting underscores __ : [\ _]{3,}

Now if I have a div in front of my underscores I want to extract it, if not I'll get only the underscores.

I have build so far this pattern (<div id(.+?)>(\w)([\ _]{3,}/*))([\ _]{3,})

The first part is build out of 3 groups 1 - a div tag, 2 - a label, 3 - underscores

1 - <div id(.+?)>, 2 - (\w) , 3 - [\ _]{3,}/*

The div with the id div2 will not take the id because it contains non-alfanumeric chars.

Q: What is wrong with my pattern ?

Desired matchs for the 4 divs:

<div id="div1">Street ___________________
______ 
<div id="div3">City _____________________
<div id="div4">City2 ____________________
_____
Misi
  • 748
  • 5
  • 21
  • 46
  • Aaaah... html parsing with a regex!!! http://stackoverflow.com/q/1732348/613130 – xanatos Aug 07 '13 at 09:35
  • 1
    @xanatos: Its not really HTML parsing because the requirements don't worry about nesting items which is the main problem with parsing via regex. – Chris Aug 07 '13 at 09:40
  • So you can have div WITHOUT id and div WITH id, and you want to extract the id (if present) and the content of these div(s), right? – xanatos Aug 07 '13 at 09:44

2 Answers2

1
  • \w is just a single character, you probably want to say one or more - \w+.

  • /* - zero or more /'s? I don't see where that fits in.

  • One or more not >'s (i.e. [^>]+) is probably a better idea than .+?. .+? will try to stop at the first >, but will continue until it finds a string that matches, i.e.:

    <div id=1>this is not valid</div><div id=2>this is valid___</div>
    

    will match the whole string, instead of just from <div id=2>.

  • As far as I can tell from your question, everything before the underscores should be optional.

Pattern:

(?:(<div id[^>]+>)(\w+))?([\ _]{3,})

C# Test.

Bernhard Barker
  • 54,589
  • 14
  • 104
  • 138
  • I think you are missing a parenthesis in your pattern. If I add a ) : (?:(
    ]+?)>(\w+))?([\ _|]{3,})(?:(\w+)([\ _|]{3,}))?) then in the second match I capture this string |__|__|__|__|__| witch is not alfanumeric.
    – Misi Aug 08 '13 at 07:18
  • Edited my answer, it should be closer to what you want now. – Bernhard Barker Aug 08 '13 at 08:44
  • In this line "
    City _____________________ State |__|__|
    " it only captures "_____________________". "City" contains alphanumeric chars so it should have been matched.
    – Misi Aug 08 '13 at 09:08
  • Seems I accidentally left the escape characters there. Fixed. – Bernhard Barker Aug 08 '13 at 09:24
  • This works better. If I want to allow spaces why doesn't it work if I add "\s" ? (\w\s+) – Misi Aug 08 '13 at 10:00
  • 1
    `\w\s+` means one word character and one or more white-space characters (not one or more of each). To say one or more of either, you can say `[\w\s]+` or `(\w|\s)+`. – Bernhard Barker Aug 08 '13 at 10:05
1

Try something like

string html = @"<div id=""div1"">Street ___________________ </div>
<div id=""div2"">CAP |__|__|__|__|__| number ______ </div>
<div id=""div3"">City _____________________ State |__|__|</div>
<div name=""hello"" id=""div4"">City _____________________ State |__|__|</div>
<div name=""house"">City _____________________ State |__|__|</div>
<div id=""notext""></div>";

var rx = new Regex(@"<div(?:(?: id=""(?<id>[^""]+)"")|[^>])*>(?<content>[^<]*)</div>", 
                   RegexOptions.IgnoreCase);

var matches = rx.Matches(html);

foreach (Match match in matches)
{
    var id = match.Groups["id"];
    var content = match.Groups["content"];

    Console.WriteLine("id present: {0}, id: {1}, text: {2}", 
                      id.Success, 
                      id.ToString(), 
                      content.ToString());
}

if it work I'll explain the regex (that is <div(?:(?: id="(?<id>[^"]+)")|[^>])*>(?<content>[^<]*)</div>)

xanatos
  • 109,618
  • 12
  • 197
  • 280
  • CAP **|__|__|__|__|__|** number ______ - this is the second match and it contains non-alfanumeric chars | and _ – Misi Aug 08 '13 at 07:13
  • @Misi AND??? It isn't clear what you want. My regex will simply extract all the
    present in an html...
    – xanatos Aug 08 '13 at 07:29
  • I don't want to extract the whole div if the content doesn't have alphanumeric characters. The Content for the 2nd match is : "CAP |__|__|__|__|__| number " – Misi Aug 08 '13 at 07:53
  • @Misi So you don't want the CAP line because it has the `|` ? Or you don't want only the CAP part and you want number ____ ? – xanatos Aug 08 '13 at 08:00
  • And what should happen with the City line? (that has the State ___ part)? – xanatos Aug 08 '13 at 08:04
  • I've edited my answer. "City" has all chars alphanumeric, "CAP |__|__|__|__|__| number" contains non-alphanumeric chars. – Misi Aug 08 '13 at 08:36