parsing string to get content

Question

I have the following html string:

<h3>I only want this content</h3> I don't want this content <b>random content</b>

And I would like to only get the content from the h3 tags and remove the other content. I have the following:

String getArticleBody = listArt.getChildText("body");
StringBuilder mainArticle = new StringBuilder();
String getSubHeadlineFromArticle;

if(getArticleBody.startsWith("<h3>") && getArticleBody.endsWith("</h3>")){
    mainArticle.append(getSubHeadlineFromArticle);
 }

But this returns the whole content, which is not what I am after. If someone could help me that would be great thanks.

See: http://stackoverflow.com/questions/16597303/extract-string-between-two-strings-in-java — Bartek Maraszek, Jul 18 '14 at 11:05

score 1 · Answer 1 · edited Jul 18 '14 at 11:36

1

Thanks, guys. All your answers worked, but I ended up using Jsoup.

String getArticleBody = listArt.getChildText("body");
org.jsoup.nodes.Document docc = Jsoup.parse(getArticleBody);
org.jsoup.nodes.Element h3Tag = docc.getElementsByTag("h3").first();
String getSubHeadlineFromArticle = h3Tag.text();

edited Jul 18 '14 at 11:36

octothorpentine

319
2
18

answered Jul 18 '14 at 11:12

Limpep

499
3
11
22

score 0 · Answer 2 · answered Jul 18 '14 at 11:06

0

You can use substring method like this -

String a="<h3>I only want this content</h3> I don't want this content <b>random content</b>";
System.out.println(a.substring(a.indexOf("<h3>")+4,a.indexOf("</h3>")));

Output -

I only want this content

answered Jul 18 '14 at 11:06

Ninad Pingale

6,801
5
32
55

score 0 · Answer 3 · answered Jul 18 '14 at 11:07

0

Try with this

String result = getArticleBody.substring(getArticleBody.indexOf("<h3>"), getArticleBody.indexOf("</h3>"))
                .replaceFirst("<h3>", "");
System.out.println(result);

answered Jul 18 '14 at 11:07

Wundwin Born

3,467
19
37

ferrerverck · Answer 4 · 2014-07-18T11:23:17.453

You need to use regex like this:

public static void main(String[] args) {
    String str = "<h3>asdfsdafsdaf</h3>dsdafsdfsafsadfa<h3>second</h3>";
    // your pattern goes here
    // ? is important since you need to catch the nearest closing tag
    Pattern pattern = Pattern.compile("<h3>(.+?)</h3>"); 
    Matcher matcher = pattern.matcher(str);
    while (matcher.find()) System.out.println(matcher.group(1));
}

matcher.group(1) returns exactly text between h3 tags.

score 0 · Answer 5 · answered Jul 18 '14 at 11:08

Using regular expression
It may helps you :

String str = "<h3>I only want this content</h3> I don't want this content <b>random content</b>";
final Pattern pattern = Pattern.compile("<h3>(.+?)</h3>");
final Matcher matcher = pattern.matcher(str);
matcher.find();
System.out.println(matcher.group(1)); // Prints String I want to extract

Output :

I only want this content

score 0 · Answer 6 · answered Jul 18 '14 at 11:25

The other answers already cover how to get the result you want. I'm gonna comment your code to explain why it isn't doing that already. (Note that I modified your variable names because strings don't get anything; they are a thing.)

// declare a bunch of variables
String articleBody = listArt.getChildText("body");
StringBuilder mainArticle = new StringBuilder();
String subHeadlineFromArticle;

// check to see if the article body consists entirely of a subheadline
if(articleBody.startsWith("<h3>") && articleBody.endsWith("</h3>")){
    // if it does, append an empty string to the StringBuilder
    mainArticle.append(subHeadlineFromArticle);
}
// if it doesn't, don't do anything

// final result:
//   articleBody = the entire article body
//   mainArticle = empty StringBuilder (regardless of whether you appended anything)
//   subHeadlineFromArticle = empty string

parsing string to get content

6 Answers6