2

Let presume we have something like this:

<div1>
    <h1>text1</h1>
    <h1>text2</h1>
</div1>
<div2>
    <h1>text3</h1>
</div2>

Using RegExp we need to get text1 and text2 but not text3.

How to do this?

Thanks in advance.

EDIT: This is just an example. The text I'm parsing could be just plain text. The main thing I want to accomplish is list all strings from a specific section of a document. I gave this HTML code for example as it perfectly resembles the thing I need to get.

(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?

EDIT2: Here is another rather dumb example. :)

Section1

This is a "very" nice sentence.
It has "just" a few words.

Section2

This is "only" an example.

The End

I need quoted words from first but not from second section.

Yet again, (?siU)"(.*)" returns quoted words from whole text, and I need only those between words Section1 and Section2.

This is for the "Rainmeter" application, which apparently uses Perl regex syntax.

I'm sorry, but I can't explain it better. :)

Brock Adams
  • 90,639
  • 22
  • 233
  • 295
mmatz
  • 23
  • 4
  • Number of

    ocurrences can be be any.

    – mmatz Aug 18 '10 at 22:46
  • 7
    What criterion determines which content you want? What language are you programming in? Also, you really shouldn't use regexes to parse HTML. – Marcelo Cantos Aug 18 '10 at 22:47
  • 1
    Refer to this post on parsing html with Regex: [link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – Kevin Vermeer Aug 18 '10 at 22:56
  • @Marcelo Cantos: Criterion can vary, but for first example, I need content inside

    tags from section. I'm not programming in any language, I'm only modifying my desktop with Rainmeter, which uses RegExp in for some parts. :) Nothing really important here.

    – mmatz Aug 18 '10 at 23:52
  • you are making your question so vague, it is impossible to answer it. – Ether Aug 19 '10 at 02:51

2 Answers2

2

Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.

meder omuraliev
  • 183,342
  • 71
  • 393
  • 434
  • 1
    Ah, but he did not use `div` tags. He used `div1` and `div2` (¿etc?). :) – Brock Adams Aug 18 '10 at 23:13
  • I take it he meant to do `div` but provided the numbers to indicate first, second. And he could also just do `getElementsByTagName` on h1 and grab the first 2 nodeValues in the nodeList. – meder omuraliev Aug 18 '10 at 23:48
  • Since number of `h1`'s varies, and I need all of them, grabbing only first two isn't the solution. As for `div1` and `div2` confusion, look at the second example to see what I need. :) – mmatz Aug 18 '10 at 23:56
2

For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:

(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and

(?siU)"(.*)"(?=.+Section2) for the second.

Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.

These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.

But maybe this is good enough for your current needs?

Brock Adams
  • 90,639
  • 22
  • 233
  • 295
  • The thing is, there are nested tags and your solution doesn't work. But by modifying it I managed to solve my problem. `(?siU)

    (.*)

    .*(?=.+)` will work even if there are nested tags/structures. Than you very much. I wouldn't be able to do it without your help. :D
    – mmatz Aug 19 '10 at 16:09