3

I have to parse below html body part as the output given below.

tags must be there in the output. The output can have {p,i,b,br} tags. remaining tags have to remove and only text has to come for output.

This is my input.

<!DOCTYPE HTML>
<html>
    <head>
        <title>Introduction</title>
    </head>
    <body>
        <article id="mobi_content">
            <h1 class="mobi-page-title">Introduction</h1>
            <section id="dataSectionInstanceId-431331" class="body-text">This book is about creating a great career. <p>You might be saying to yourself, "I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!" <p>Well, if you're looking, we're going to show you how to get that great job now. That's the first, short-term step. <p>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. <p>This book is about today and tomorrow. It's about getting a great job now and enjoying a great career for life. <p>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: "I can't believe I get paid for doing this!" Are only a few people entitled to feel that way, but not the rest of us? <p>And what about you? Are you looking forward to a great career? Would you describe your current career as "great"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? <p>Furthermore, just how do you create a great career for yourself? <p>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you.
            </section>
        </article>
    </body>
</html>

output expecting like:

This book is about creating a great career.
<P>You might be saying to yourself, "I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!" 
<P>Well, if you're looking, we're going to show you how to get that great job now. That's the first, short-term step. 
<P>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. 
<P>This book is about today and tomorrow. It's about getting a great job now and enjoying a great career for life. 
<P>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: "I can't believe I get paid for doing this!" Are only a few people entitled to feel that way, but not the rest of us? 
<P>And what about you? Are you looking forward to a great career? Would you describe your current career as "great"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? 
<P>Furthermore, just how do you create a great career for yourself? 
<P>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you.

My code:

doc.body().traverse(new NodeVisitor() {

    @Override
    public void head(Node node, int depth) {

        String name = node.nodeName();
        String paraText = "";

        if (node instanceof TextNode) {

            TextNode tn = ((TextNode) node);

            if (node.nodeName().equals("p")) {
                //finalHtml+="<p>"+tn.text()+"</p>";
            } else {
                finalHtml += tn.text();
            }

        } else if (node instanceof Node) {

            if (node.nodeName() == "p") {
                System.out.println("fnbdnv"+node.toString());
            }
            if (node.nodeName() == "h1") {
                // finalHtml+="<p>"+node.toString()+"<p>";
            } else if (node.nodeName() == "div") {
                node.removeAttr("class");
                finalHtml += node.toString();
            } else if (node.nodeName() == "seection") {
                    finalHtml += node.toString();
            } else if (node.nodeName() == "<b>") {
                finalHtml += node.toString();
            } else if (node.nodeName() == "<i>") {
                finalHtml += "<i>" + node.toString() + "</i>";
            }
        }

    }

    @Override
    public void tail(Node node, int depth) {
        // Do Nothing
    }
});
Mark
  • 2,380
  • 11
  • 29
  • 49

1 Answers1

0

Maybe some regex would be better in this occasion.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

    public static void main(String[] args) {
        try {
            String html = "<!DOCTYPE HTML>" +
                            "<html>" +
                                "<head>" +
                                    "<title>Introduction</title>" +
                                "</head>" +
                                "<body>" +
                                    "<article id=\"mobi_content\">" +
                                        "<h1 class=\"mobi-page-title\">Introduction</h1>" +
                                        "<section id=\"dataSectionInstanceId-431331\" class=\"body-text\">This <i>book</i> is about creating a great career. <p>You might be saying to yourself, \"I don't want to talk about a career, much less a great career. Right now I just need a job. I need to eat!\" <p>Well, if you're looking, we're going to show you how to get that great job now. That's the first, short-term step. <p>But the day will come when you'll want to do more than just eat. And beyond that day will come another day when you look back at your life and take measure of your entire professional contribution to the world. <p>This book is about today and tomorrow. It's about getting a great job now and enjoying a great career for life. <p>When we say a person has had a great career, what do we mean? That he or she made a lot of money? moved spectacularly up the corporate ladder? became famous or renowned in his or her profession? What about the familiar comment from every movie star on every talk show: \"I can't believe I get paid for doing this!\" Are only a few people entitled to feel that way, but not the rest of us? <p>And what about you? Are you looking forward to a great career? Would you describe your current career as \"great\"? When you get to the end of your productive life, will you be looking back on a mediocre career? a good career? a great career? And how will you know? <p>Furthermore, just how do you create a great career for yourself? <p>As coauthors of this book, we are fascinated by these provocative questions. We have been associated in our work for many years as avid students of what it takes to build a great life and career. And we bring two different sets of experiences to the issue, so occasionally, we will speak to you directly in our own voices. We'll share with you our discoveries and provide tools and insights that will help you find answers for yourself. Whether you're looking for a job or want to make the job you have more meaningful, this book is for you." +
                                        "</section>" +
                                    "</article>" +
                                "</body>" + 
                                "</html>";

            Document doc = Jsoup.parse(html);


            System.out.println(removeTags(doc.body().toString()));

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static String removeTags(String source) {    
        return source.replaceAll("(?!(</?p>|</?i>|</?b>|<br/?>))(</?.*?>)", " ");
    }
}

Update

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class Main {

    public static void main(String[] args) {
        try {
            String html = "<!DOCTYPE HTML>" +
                            "<html>" +
                                "<head>" +
                                    "<title>Introduction</title>" +
                                "</head>" +
                                "<body> <article id=\"mobi_content\"> <h1 class=\"mobi-page-title\">\"Build Your Village\" Tool</h1> <section id=\"dataSectionInstanceId-431408\" class=\"body-text\"><p class=\"nonindent\">Your great career depends not only on you,</p> <p class=\"nonindent\">Sample deposits in the Emotional Bank Account:</p> <ul class=\"bullet\"> <li><p class=\"nonindent\">Congratulate the person on a job well done.</p></li> <li><p class=\"nonindent\">Send birthday greetings.</p></li></section></article></body>" +
                                "</html>";

            Document doc = Jsoup.parse(html);


            System.out.println(removeTags(doc.body().toString()));

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static String removeTags(String source) {    
        return source.replaceAll("(?!(</p>|<p .*?>|</?i>|</?b>|<br/?>))(</?.*?>)", " ");
    }
}

Update 2

import java.util.ListIterator;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Attribute;
import org.jsoup.nodes.Attributes;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Main {

    public static void main(String[] args) {
        try {
            Pattern pattern = Pattern.compile("/(((?!/).)*)[.]");

            String html = "<!DOCTYPE HTML>" +
                            "<html>" +
                                "<head>" +
                                    "<title>Introduction</title>" +
                                "</head>" +
                                "<body> <article id=\"mobi_content\"> <h1 class=\"mobi-page-title\">\"Build Your Village\" Tool</h1> <section id=\"dataSectionInstanceId-431408\" class=\"body-text\"><p class=\"nonindent\">Your great career depends not only on you,</p> <p class=\"center\"><img src=\"mpla/multimedia/Cove_9781936111107_epub_005_r1.png\" id=\"mobi_image_12776\" class=\"inline-img\" alt=\"PNG\"/></p><p class=\"nonindent\">Sample deposits in the Emotional Bank Account:</p> <ul class=\"bullet\"> <li><p class=\"nonindent\">Congratulate the person on a job well done.</p></li> <li><p class=\"nonindent\">Send birthday greetings.</p></li></section></article></body>" +
                                "</html>";

            Document doc = Jsoup.parse(html);
            Elements imgs = doc.select("img");
            System.out.println(imgs);
            ListIterator<Element> iter = imgs.listIterator();
            while(iter.hasNext()) {
                Element img = iter.next();
                String src = img.attr("src");     
                Matcher matcher = pattern.matcher(src);
                if (matcher.find()) {
                    img.tagName("graphic").text(matcher.group(1)); 
                    removeAttr(img);
                }         
            }

            System.out.println(removeTags(doc.body().toString()));

        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    public static void removeAttr(Element e) {
        Attributes at = e.attributes();
        for (Attribute a : at) {
            e.removeAttr(a.getKey());
        }
    }

    public static String removeTags(String source) {    
        return source.replaceAll("(?!(</p>|<p .*?>|</?graphic>|</?i>|</?b>|<br/?>))(</?.*?>)", " ").trim();
    }
}
Alkis Kalogeris
  • 17,044
  • 15
  • 59
  • 113
  • 1
    Hi alkis, Thanks for the great response.but, if i use .toString() method special charectors coming like {>,"}. Is there any solution for this except replace method.. – rajyalakshmi Sep 04 '14 at 07:30
  • That didn't happen to me. Did you change anything? – Alkis Kalogeris Sep 04 '14 at 08:45
  • I didn't change any code. But when I try for another text which contains HTML quotes. They are displaying like that – rajyalakshmi Sep 04 '14 at 10:32
  • Please post that html so I can test it. – Alkis Kalogeris Sep 04 '14 at 10:34
  • 1

    "Build Your Village" Tool

    p class="nonindent">Your great career depends not only on you,

    Sample deposits in the Emotional Bank Account:

    • Congratulate the person on a job well done.

    • Send birthday greetings.

    – rajyalakshmi Sep 04 '14 at 11:18
  • `p class="nonindent">` You were missing the starting `<`. Furthermore, I've improved the regex so it can handle p tags with classes and/or ids – Alkis Kalogeris Sep 04 '14 at 12:17
  • Hi alkis, Is it possible to parse node wise from starting to ending of html.Actually i have to parse different formats of html's.So, parsing of each node helps alot. can you help me in that way – rajyalakshmi Sep 05 '14 at 04:35
  • I don't understand your question. Could you be more specific? If this is another question, please consider opening a new thread. – Alkis Kalogeris Sep 05 '14 at 11:30
  • can you help me out here. input:::

    PNG

    :::: expecting like

    Cove_9781936111107_epub_005_r1

    – rajyalakshmi Sep 08 '14 at 06:45
  • Displaying like this...

    Congratulate the person on a job well done.

    Send birthday greetings.

    .. But need to remove attribute class="nonindent".I have tried to remove attribute but it is removing total front tag
    – rajyalakshmi Sep 09 '14 at 09:58