Regex for finding everything possible (including foreign characters) between two HTML tags

Question

I'm using Regex to find the content between specific HTML tags.

However, the content may be in a foreign language and can include absolutely anything.

I am trying to figure out a regex where I can capture absolutely everything between tags. I've seen articles and Q/As on specific cases but I can't figure out how to put them all together (especially the foreign character requirement).

Does anyone have any solutions/ideas?

could you please provide some of the `html` you're talking about or what Regex have you tried so far? — Savv, Oct 30 '13 at 16:09

score 0 · Answer 1 · answered Oct 30 '13 at 16:09

0

<.*?> should grab anything regardless of what it could be.

answered Oct 30 '13 at 16:09

Voidpaw

910
1
5
18

Would not that match _inside_ a tag rather than _between_ tags? – David Pärsson Oct 30 '13 at 16:10
Well yeah, if you mean "between" as in `>sometext<` then use the brackets the the reverse order. – Voidpaw Oct 30 '13 at 16:20

score 0 · Answer 2 · answered Oct 30 '13 at 16:10

0

You can use the following pattern to match any character but the less-than sign: [^<]

That will match foreign characters but not the first character of the tag.

answered Oct 30 '13 at 16:10

David Pärsson

6,038
2
37
52

score 0 · Answer 3 · answered Oct 30 '13 at 21:55

Solution

>.*?<

Caveat

regex is really bad for parsing HTML. Think for example about a situation where there was a '<' between two HTML tags, your regex would mess up pretty bad.

Please consider using something like jsoup, it's a real small library for Java that works miracles on HTML parsing.

Regex for finding everything possible (including foreign characters) between two HTML tags

3 Answers3