2

So I have a program that can create a docx file and now I want to implement a code that takes the docx file and converts it into html.

I have somewhat made it functional but the problem I'm having is that the converter does not detect the headers/title or a list. I've tried to make it detect headers but the if condition is never met and I'm not sure why.

So this is our input where the first C++ is supposed to be the heading:

C++ <-- header
C++ is a general purpose, high-level programming language developed by Sun Microsystems. 
The C++ programming language was developed by a small team of engineers, 
known as the Green Team, who initiated the language in 1991.
- Chapter 1
- Chapter 2
- Chapter 3

and here is the converter code:

package lab2.converter;

import java.io.*;
import java.util.*;
import org.apache.poi.xwpf.usermodel.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class HtmlConverter implements Converter{

    @Override
    public void convert() throws IOException {
        try {
            // 1. Read in the Word document using Apache POI
            FileInputStream fis = new FileInputStream("output/Find&Replace.docx");
            XWPFDocument document = new XWPFDocument(fis);

            // 2. Convert the document to HTML using jsoup
            Document htmlDoc = new Document("");
            Element html = htmlDoc.appendElement("html");
            Element body = html.appendElement("body");

            List<XWPFParagraph> paragraphs = document.getParagraphs();
            for (XWPFParagraph paragraph : paragraphs) {
                Element p = body.appendElement("p");
                String text = paragraph.getText();

                //if condition to detect header and list
                //not implemented

                // Add the text content to the paragraph
                p.appendText(text);
            }

            // 3. Write the HTML to a file
            try (PrintWriter out = new PrintWriter("output/Find&Replace.html")) {
                out.println(htmlDoc.html());
            }

            System.out.println("Conversion complete.");
            document.close();

        } catch (Exception e) {
            System.err.println("Error converting Word document to HTML: " + e.getMessage());
        }

    }
}

After running the docx file into the converter we get this:

<html>
 <body>
  <p>Evaluation Warning: The document was created with Spire.Doc for JAVA.</p>
  <p>Evaluation Warning: The document was created with Spire.Doc for C++.</p>
  <p>C++</p>
  <p>C++ is a general purpose, high-level programming language developed by Sun Microsystems. The C++ programming language was developed by a small team of engineers, known as the Green Team, who initiated the language in 1991.</p>
  <p>Chapter 1</p>
  <p>Chapter 2</p>
  <p>Chapter 3</p>
 </body>
</html>

Like I said earlier I've tried conditionts to be met if they detect a header and lists but the problem I'm having is that I'm not sure how to detect a header with the libraries I'm using.

The libraries im using to make the docx file uses the com.spire.doc library.

The converter is using org.apache.poi and org.jsoup.nodes. So I'm not sure if It's that i can't mix libraries or the fact that I don't have the knowledge to find a header and lists in the docx file.

apo
  • 43
  • 4
  • 2
    "So I have a program that can create a docx file..." So, make a copy of that program and modify it to create an HTML file. – Gilbert Le Blanc Feb 27 '23 at 22:42
  • I dont wan't to just create a html file. But I want to be able to read a docx file and from there convert it to a html file. Cause if I copied the other program I wouldn't really convert anything but rather just create the html file. – apo Feb 28 '23 at 09:17
  • Did you try with Apache Tika? Tika will output `

    ` etc tags if the right styles were set in Word

    – Gagravarr Feb 28 '23 at 10:59

0 Answers0