0

I had an issue where i want to split single HTML file to multiple HTML files using Java, the html file has multiple chapters of a text book in a in a single HTML file but i want each chapter in single HTML file, each chapter start can be identified using h2 tag with some id. Attached a sample HTML file that i want to split it to multiple HTML files.

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC '-//W3C//DTD XHTML 1.1//EN' 'http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd'>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta name="generator" content="HTML Tidy for Linux (vers 7 December 2008), see www.w3.org"/>
<title>Sample HTML</title>




<link rel="stylesheet" href="0.css" type="text/css"/>
<link rel="stylesheet" href="1.css" type="text/css"/>
<link rel="stylesheet" href="sample.css" type="text/css"/>
<meta name="generator" content="sample content"/>
</head>
<body><div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00007">Chapter 7</h2>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0008"><!--  H2 anchor --></a></p>
<div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00008">Chapter 8</h2>
p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0009"><!--  H2 anchor --></a></p>
<div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00009">Chapter 9</h2>
p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0010"><!--  H2 anchor --></a></p>
<div class="c2"><br/>
<br/>
<br/>
<br/></div>
<h2 id="pg00010">Chapter 10</h2>
p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p>sample paragraph 1</p>
<p><a id="link2HCH0011"><!--  H2 anchor --></a></p>
</body></html>
NagaRajendra
  • 159
  • 3
  • 16
  • java and javascript are very different things – Observer Apr 05 '16 at 19:13
  • Do you need to do this programmatically? It would be much easier to use an HTML editor to do this. – ControlAltDel Apr 05 '16 at 19:17
  • Yes I need to do it programmatically as i will be dealing with some 100 files like this – NagaRajendra Apr 05 '16 at 19:18
  • Use a parser like jsoup. http://stackoverflow.com/a/2170950/971067 – rdonuk Apr 05 '16 at 19:22
  • Note: stack overflow is not a "do my work for me" site. It is a question-and-answer site. People here don't view requests for ready-made solutions very positively. You should do some research, try to solve the problem yourself, and post a question when you run into a problem. – RealSkeptic Apr 05 '16 at 19:31
  • @RealSkeptic I do know that its not a freelancing site, i'm trying to get some help/ suggestions like the above guy mentioned about Jsoup, I'm looking into that, if you know some tools or plugin related to this please let me know, and i don not think i bugged you specifically, if i do so i'm sorry for wasting your time – NagaRajendra Apr 05 '16 at 19:40

2 Answers2

1

Not entirely sure whether it would work but i guess you can take a parser like http://jsoup.org/ and use it as follows:

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements chapters = doc.select("h2"); 

you then have to extract the content of the element and persist it as a new HTML file (including body, etc)

uniknow
  • 938
  • 6
  • 5
  • Thanks, i will try it – NagaRajendra Apr 05 '16 at 21:09
  • using this looks like a solution, but using select("h2") is giving the contents of h2 so the resulting output of each html Element is chaper7, chapter8 etc. is there a way at least to extract

    content between each

    so that i can write that

    content to different html, i appreciate your help.

    – NagaRajendra Apr 05 '16 at 21:38
  • You could do something like described in [jsoup how to get all html between 2 header tags](http://stackoverflow.com/questions/6534456/jsoup-how-to-get-all-html-between-2-header-tags) – uniknow Apr 06 '16 at 07:18
  • looks like that, thanks for that I'm trying to achieve almost the same as that, Thanks again – NagaRajendra Apr 06 '16 at 14:17
0

Finally i'm able to do it here is the solution to split html as per my need in the question

public class App {
public static void JsoupReader(){
    File input = new File("src/resources/sample_book.htm.html");
    try {
        Document doc = Jsoup.parse(input, "UTF-8");
        Element head = doc.select("head").first();
        Element firstH2  = doc.select("h2").first();
        Elements siblings = firstH2.siblingElements();
        String h2Text = firstH2.html();
        List<Element> elementsBetween = new ArrayList<Element>();
        for(int i=1;i<siblings.size(); i++){
            Element sibling = siblings.get(i);
            if(!"h2".equals(sibling.tagName())){
                elementsBetween.add(sibling);
            }else{
                processElementsBetween(h2Text, head, elementsBetween);
                  elementsBetween.clear();
                  h2Text = sibling.html();
            }
        }

         if (! elementsBetween.isEmpty())
                processElementsBetween(h2Text, head, elementsBetween);


    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

private static void processElementsBetween(String h2Text,Element head,
        List<Element> elementsBetween) throws IOException {

    File newHtmlFile = new File("src/resources/"+h2Text+".html");
    StringBuffer htmlString = new StringBuffer("");
    htmlString.append("<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\">");
    htmlString.append(head);
    htmlString.append("<body>"
            +"<div class=\"c2\">"
            +"<br/>"
            +"<br/>"
            +"<br/>"
            +"<br/>"
            +"</div>");
      System.out.println("---");
      for (Element element : elementsBetween) {
          htmlString.append(element.toString());
              }
      htmlString.append("</body></html>");
      FileUtils.writeStringToFile(newHtmlFile, htmlString.toString());
    }   

Thanks for your help uniknow and realskeptic for your criticism.

Community
  • 1
  • 1
NagaRajendra
  • 159
  • 3
  • 16