1

I'm using Regex to find the content between specific HTML tags.

However, the content may be in a foreign language and can include absolutely anything.

I am trying to figure out a regex where I can capture absolutely everything between tags. I've seen articles and Q/As on specific cases but I can't figure out how to put them all together (especially the foreign character requirement).

Does anyone have any solutions/ideas?

CodyBugstein
  • 21,984
  • 61
  • 207
  • 363

3 Answers3

0

<.*?> should grab anything regardless of what it could be.

Voidpaw
  • 910
  • 1
  • 5
  • 18
0

You can use the following pattern to match any character but the less-than sign: [^<]

That will match foreign characters but not the first character of the tag.

David Pärsson
  • 6,038
  • 2
  • 37
  • 52
0

Solution

>.*?<

Caveat

regex is really bad for parsing HTML. Think for example about a situation where there was a '<' between two HTML tags, your regex would mess up pretty bad.

Please consider using something like jsoup, it's a real small library for Java that works miracles on HTML parsing.

Josh T
  • 564
  • 3
  • 12