0

haystack:

<h2 >a&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;
</h2>
<div class="indent">
aaaa
</div>
<h2 >b&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;
</h2>
<div class="indent">
bbbb
</div>

pattern I used:

#<h2[^>]*>(a|b)(?!</h2>)[\s\S]*</h2><div class="indent">((?!</div>)[\s\S]+)</div>#

this pattern only matches the first h2 content(e.g. a&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;) and the content in last div(e.g. bbbb)

but I whan it to match all content in the h2 and div to make an one to one map(e.g. a&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;=>aaaa,b&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;=>bbbb), how do I do this?

dotslashlu
  • 3,361
  • 4
  • 29
  • 56

1 Answers1

1

[\s\S]* and [\s\S]+ are greedy, meaning they will match as many characters as possible. Try changing them to [\s\S]*? and [\s\S]+?.

With your current regex, if you were to put your [\s\S]* into a capturing group you would see that it matches the following:

&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;
</h2>
<div class="indent">
aaaa
</div>
<h2 >b&nbsp; &middot;&nbsp;&middot;&nbsp;&middot;

Adding the ? at the end makes this lazy, so instead of matching as much as possible it will match as few characters as possible, so it will stop at the first </h2> like you want. The same reasoning applies to the [\s\S]+ later in your regex.

It also looks like this should fail on your sample string because you have </h2><div... in the middle of your regex, but in your sample text there is always a newline between the closing </h2> and the <div>, you should probably change this section to </h2>\s*<div.... End result:

#<h2[^>]*>(a|b)(?!</h2>)[\s\S]*?</h2>\s*<div class="indent">((?!</div>)[\s\S]+?)</div>#

But don't parse HTML with regex!

Community
  • 1
  • 1
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
  • it works! thank you. And thanks for your reminding, but is crawling in the range of 'parse'? If I don't use regex to make a crawler then what should i use? – dotslashlu Jun 14 '12 at 22:47
  • @Wasabi You should use an HTML parser that someone else has already written and done correctly. – Andrew Clark Jun 14 '12 at 22:50