So I have a program that can create a docx file and now I want to implement a code that takes the docx file and converts it into html.
I have somewhat made it functional but the problem I'm having is that the converter does not detect the headers/title or a list. I've tried to make it detect headers but the if condition is never met and I'm not sure why.
So this is our input where the first C++ is supposed to be the heading:
C++ <-- header
C++ is a general purpose, high-level programming language developed by Sun Microsystems.
The C++ programming language was developed by a small team of engineers,
known as the Green Team, who initiated the language in 1991.
- Chapter 1
- Chapter 2
- Chapter 3
and here is the converter code:
package lab2.converter;
import java.io.*;
import java.util.*;
import org.apache.poi.xwpf.usermodel.*;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class HtmlConverter implements Converter{
@Override
public void convert() throws IOException {
try {
// 1. Read in the Word document using Apache POI
FileInputStream fis = new FileInputStream("output/Find&Replace.docx");
XWPFDocument document = new XWPFDocument(fis);
// 2. Convert the document to HTML using jsoup
Document htmlDoc = new Document("");
Element html = htmlDoc.appendElement("html");
Element body = html.appendElement("body");
List<XWPFParagraph> paragraphs = document.getParagraphs();
for (XWPFParagraph paragraph : paragraphs) {
Element p = body.appendElement("p");
String text = paragraph.getText();
//if condition to detect header and list
//not implemented
// Add the text content to the paragraph
p.appendText(text);
}
// 3. Write the HTML to a file
try (PrintWriter out = new PrintWriter("output/Find&Replace.html")) {
out.println(htmlDoc.html());
}
System.out.println("Conversion complete.");
document.close();
} catch (Exception e) {
System.err.println("Error converting Word document to HTML: " + e.getMessage());
}
}
}
After running the docx file into the converter we get this:
<html>
<body>
<p>Evaluation Warning: The document was created with Spire.Doc for JAVA.</p>
<p>Evaluation Warning: The document was created with Spire.Doc for C++.</p>
<p>C++</p>
<p>C++ is a general purpose, high-level programming language developed by Sun Microsystems. The C++ programming language was developed by a small team of engineers, known as the Green Team, who initiated the language in 1991.</p>
<p>Chapter 1</p>
<p>Chapter 2</p>
<p>Chapter 3</p>
</body>
</html>
Like I said earlier I've tried conditionts to be met if they detect a header and lists but the problem I'm having is that I'm not sure how to detect a header with the libraries I'm using.
The libraries im using to make the docx file uses the com.spire.doc library.
The converter is using org.apache.poi and org.jsoup.nodes. So I'm not sure if It's that i can't mix libraries or the fact that I don't have the knowledge to find a header and lists in the docx file.
` etc tags if the right styles were set in Word
– Gagravarr Feb 28 '23 at 10:59