-4

Given any String in a tag based language (like XML) I need to parse it. Tags can consist of any characters. For example:

String str = "<h1>Some text1</h1>\n" +
             "<jkl><h1>Some text2</h1></jkl>" + 
             "<someTag>Some text3</someTag>";

After parsing it should look like this:

Some text1
Some text2
Some text3
Gooz
  • 1,106
  • 2
  • 9
  • 20
Viorel Casapu
  • 217
  • 4
  • 12
  • 5
    When you say "like XML" - arbitrary SGML? Some exotic variation of SGML? Why do you have to use regular expressions rather than a proper parser for the language you're interested in? – Jon Skeet Mar 19 '17 at 20:54
  • just parse by tag name, why not parse using an xml parser – Remario Mar 19 '17 at 21:03
  • Ok, you told us what you need to do. Now, what is your question exactly? How did you try to solve it yourself, and where are you stuck? – Jesper Mar 19 '17 at 21:03
  • As described [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454), you simply can't write a full-scale parser for XML or XML-like languages using RegExp because they're not regular languages. Actually, you can't use RegEx for *any* recursive language because recursive languages aren't regular by definition. It'll only work if you're trying to do some *very* restricted task. I have to agree with @JonSkeet here - why not just go with an appropriate parser? Also, isn't the code sample you give just XML without a root element? – EJoshuaS - Stand with Ukraine Mar 19 '17 at 21:46

3 Answers3

1

It seems like when you say "parse," you really mean "delete."

Try something like:

str.replaceAll("<[^>]*?>", "")

In english:

"<       find an opening <

[^>]*?    followed by any character not a >, zero or more times, reluctantly

>"       followed by a >
aghast
  • 14,785
  • 3
  • 24
  • 56
  • this answer does not answer the question , he needs group values(inner), he needs a xml parser for efficiency . – Remario Mar 19 '17 at 21:05
  • Appending `?` means more jobs engine has to do. A greedy quantifier works better. – revo Mar 19 '17 at 21:07
  • @CaspainCaldion the desired output, according to the OP, is same text, no tags. He doesn't need to parse anything, just remove the tags. – aghast Mar 19 '17 at 22:43
  • @Austin Hastings, it seems very clever to use just replaceAll, I haven't thought about that and it's easy to understand. It works exactly I needed. – Viorel Casapu Mar 20 '17 at 12:07
1

Use Jsoup like this.

String str = "<h1>Some text1</h1>\n" +
    "<jkl><h1>Some text2</h1></jkl>" +
    "<someTag>Some text3</someTag>";
Document doc = Jsoup.parse(str);
System.out.println(doc.text());

output:

Some text1 Some text2Some text3
0

XML Parser provides way how to access or modify data present in an XML document. Java provides multiple options to parse XML document. Following are various types of parsers which are commonly used to parse XML documents.

Dom Parser - Parses the document by loading the complete contents of the document and creating its complete hiearchical tree in memory.

SAX Parser - Parses the document on event based triggers. Does not load the complete document into the memory.

JDOM Parser - Parses the document in similar fashion to DOM parser but in more easier way.

StAX Parser - Parses the document in similar fashion to SAX parser but in more efficient way.

XPath Parser - Parses the XML based on expression and is used extensively in conjuction with XSLT.

DOM4J Parser - A java library to parse XML, XPath and XSLT using Java Collections Framework , provides support for DOM, SAX and JAXP.

Please read the aforementioned.Also if it is html directly, consider jsoup.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

Remario
  • 3,813
  • 2
  • 18
  • 25