3

I'm trying to match the text content from the first tag <test>.

For example:

<test>SAMPLE TEXT</test><test>SAMPLE TEXT2</test><test>SAMPLE TEXT3</test>

If I use

("<test>(.*)</test>")`

I got this:

SAMPLE TEXT</test><test>SAMPLE TEXT2</test><test>SAMPLE TEXT3

How to get just the content from the first <test> tag: SAMPLE TEXT?

abatishchev
  • 98,240
  • 88
  • 296
  • 433
Mega
  • 557
  • 2
  • 10
  • 22
  • 5
    That looks like XML. Luckily .NET has some really excellent, easy-to-use XML parsing libraries. Why not use them? – Mark Byers Apr 18 '12 at 13:15
  • Yes, I know.. I have already being using them. But in this case I really need the Regular Expression. This is my example just to show what I need, but in practice it's not the valid xml. – Mega Apr 18 '12 at 13:31

4 Answers4

4

(.*) is greedy (meaning "everything you can match until you find the last </test>"), you're looking for the non-greedy version (.*?) (meaning "as little as you can match until you find the very first </test>").

Do however keep in mind the call of Cthulu when thinking about parsing HTML with regex and take a look at this question for a discussion about the best practices for parsing HTML with .NET. Or, if this is XML (not HTML), then by all means, do it the proper (and easy) way with an XmlReader.

Community
  • 1
  • 1
rid
  • 61,078
  • 31
  • 152
  • 193
1

Instead of .* use .*?

The question mark makes the asterisk lazy, causing it to match as little as possible. Without it, the asterisk is greedy and matches as much as it can.

Indrek
  • 867
  • 8
  • 27
1

Answer of @Radu is very good, but also try review apply following:

"<test>([^<]*)</test>"
Dewfy
  • 23,277
  • 13
  • 73
  • 121
  • Well, that won't match ``. Then again, XML parsing is full of pitfalls. – rid Apr 18 '12 at 13:20
  • @Radu fully agree. That is why you answer is better. But this case may be very fast when Ljupco_Sofijanov really sure that only TEXT is possible inside. – Dewfy Apr 18 '12 at 13:22
1

I agree that you could use XML parsing libraries, but I'll reply anyway :

("<test>([^<]*)</test>")

would parse all characters different from '<', which is the first character you want to ignore.

HTH.

Skippy Fastol
  • 1,745
  • 2
  • 17
  • 32