RegExp get string inside string

Question

Let presume we have something like this:

<div1>
    <h1>text1</h1>
    <h1>text2</h1>
</div1>
<div2>
    <h1>text3</h1>
</div2>

Using RegExp we need to get text1 and text2 but not text3.

How to do this?

Thanks in advance.

EDIT: This is just an example. The text I'm parsing could be just plain text. The main thing I want to accomplish is list all strings from a specific section of a document. I gave this HTML code for example as it perfectly resembles the thing I need to get.

(?siU)<h1>(.*)</h1> would parse all three strings, but how to get only first two?

EDIT2: Here is another rather dumb example. :)

Section1

This is a "very" nice sentence.
It has "just" a few words.

Section2

This is "only" an example.

The End

I need quoted words from first but not from second section.

Yet again, (?siU)"(.*)" returns quoted words from whole text, and I need only those between words Section1 and Section2.

This is for the "Rainmeter" application, which apparently uses Perl regex syntax.

I'm sorry, but I can't explain it better. :)

What criterion determines which content you want? What language are you programming in? Also, you really shouldn't use regexes to parse HTML. — Marcelo Cantos, Aug 18 '10 at 22:47
Refer to this post on parsing html with Regex: [link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) — Kevin Vermeer, Aug 18 '10 at 22:56
@Marcelo Cantos: Criterion can vary, but for first example, I need content inside
tags from section. I'm not programming in any language, I'm only modifying my desktop with Rainmeter, which uses RegExp in for some parts. :) Nothing really important here. — mmatz, Aug 18 '10 at 23:52
you are making your question so vague, it is impossible to answer it. — Ether, Aug 19 '10 at 02:51

score 2 · Answer 1 · answered Aug 18 '10 at 22:49

2

Use a DOM library and getElementsByTagName('div') and you'll get a nodeList back. You can reference the first item with ->item(0) and then getElementsByTagName('h1') using the div as a context node, grab the text with ->nodeValue property.

answered Aug 18 '10 at 22:49

meder omuraliev

183,342
71
393
434

1

Ah, but he did not use `div` tags. He used `div1` and `div2` (¿etc?). :) – Brock Adams Aug 18 '10 at 23:13
I take it he meant to do `div` but provided the numbers to indicate first, second. And he could also just do `getElementsByTagName` on h1 and grab the first 2 nodeValues in the nodeList. – meder omuraliev Aug 18 '10 at 23:48
Since number of `h1`'s varies, and I need all of them, grabbing only first two isn't the solution. As for `div1` and `div2` confusion, look at the second example to see what I need. :) – mmatz Aug 18 '10 at 23:56

score 2 · Accepted Answer · answered Aug 19 '10 at 08:55

For the general case of the two examples provided -- for use in Rainmeter regex -- you can use:

(?siU)<h1>(.*)</h1>(?=.+<div2>) for the first sample and

(?siU)"(.*)"(?=.+Section2) for the second.

Note that Rainmeter seems to escape things for you, but you might need to change " to \", above.

These both use Positive Lookahead but beware: both solutions will fail in the case of nested tags/structures or if there are mutiple Section1's and Section2's. Regex is not the best tool for this kind of parsing.

But maybe this is good enough for your current needs?

The thing is, there are nested tags and your solution doesn't work. But by modifying it I managed to solve my problem. `(?siU)
(.*)
.*(?=.+)` will work even if there are nested tags/structures. Than you very much. I wouldn't be able to do it without your help. :D — mmatz, Aug 19 '10 at 16:09

RegExp get string inside string

ocurrences can be be any.

tags from section. I'm not programming in any language, I'm only modifying my desktop with Rainmeter, which uses RegExp in for some parts. :) Nothing really important here.

2 Answers2

(.*)