-3

I'm trying to create a simple java converter from html to md in java, found the answer html to md however it seems to be quite outdated and no longer works, bc of the below stack trace, is there any chance to convert html to md in 2018 with any of the jvm based languages?

Both of the files (html, xsl) are properly formatted as UTF-8 and don't contain any fancy characters

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

here is the code i'm tuning

public static void main(String[] args) throws TransformerException {
    final String md = convert(htmlLocation);
}

public static String convert(final String htmlLocation) throws TransformerException {

    if (html == null) {
        return "";
    }

    final File xslFile = new File(xslLocation);
    final Source htmlSource = new StreamSource(new StringReader(htmlLocation));
    final Source xslSource = new StreamSource(xslFile);

    final TransformerFactory transformerFactory = TransformerFactory.newInstance();
    final Transformer transformer = transformerFactory.newTransformer(xslSource);

    final StringWriter result = new StringWriter();
    transformer.transform(htmlSource, new StreamResult(result));

    return result.toString();
}

content of html

<html>
    <h1>Lorem ipsum dolor</h1>
    <h2>Lorem ipsum dolor</h2>
    <p>Lorem ipsum dolor</p>
</html>

for anyone who is straggling with the same issue please refer to the project that does the conversion without xslt

https://github.com/pnikosis/jHTML2Md

hdmiimdh
  • 384
  • 6
  • 19
  • Is the snippet you have shown the complete content of the HTML document you are trying to feed to the XSLT stylesheet? Because XSLT by default processes a well-formed XML document and that HTML fragment certainly is not a well-formed XML document. – Martin Honnen Dec 23 '18 at 14:43
  • it doesn't change the issue that i'm getting even in case you wrap those line in – hdmiimdh Dec 23 '18 at 14:45
  • You are talking about an HTML file and your variable is named htmlLocation but your code using a StringReader over the htmlLocation variable would only work if the variable contains the HTML contents and not the file. So it is not clear what you are actually doing. If you have an XHTML file then use `htmlSource = new StreamSource("foo.xhtml")`. – Martin Honnen Dec 23 '18 at 15:00
  • Also, that stylesheet http://www.lowerelement.com/Geekery/XML/markdown.xsl expects real XHTML with elements being in the XHTML namespace (`http://www.w3.org/1999/xhtml`), not elements in no namespace. – Martin Honnen Dec 23 '18 at 15:02
  • @MartinHonnen it doesn't work either way, just tried with valid xhtml by putting the content in the body – hdmiimdh Dec 23 '18 at 15:08
  • I am not sure the general task of converting HTML to Markdown is solved well by grabbing some stylesheet written in 2004 but that stylesheet, when applied to simple XHTML 1 in the online example http://xsltransform.hikmatu.com/94hvTyP, seems to produce some conversion. As for your Java code, it is still not clear whether you variable `htmlLocation` contains the XHTML content you are trying to convert or the file path or URL of the XHTML document. – Martin Honnen Dec 23 '18 at 16:07

1 Answers1

1
org.xml.sax.SAXParseException; 
lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.

This can be due to a hidden character at start of the file. Possibly, the file which you are trying to convert is holding data in UTF-8 BOM (Byte Order Mark). You can easily convert such file content to UTF-8 and then run your program. For Mac you can use this command to remove BOM.

akash
  • 22,664
  • 11
  • 59
  • 87
  • doesn't help a lot, the same issue – hdmiimdh Dec 23 '18 at 13:21
  • @UladzislauKuzmin Did you convert your file to UTF-8? if yes how? You need to fix encoding of the file you are parsing.. – akash Dec 23 '18 at 13:22
  • sure, i've changed the encoding via sublime, im a mac user, it says that format of the file is UTF-8, in case i run `file markdown.xsl` in terminal it gives me the following results `markdown.xsl: exported SGML document text, ASCII text, with very long lines` that's basically the same – hdmiimdh Dec 23 '18 at 13:24
  • Can you try to remove BOM with [this](https://unix.stackexchange.com/questions/381230/how-can-i-remove-the-bom-from-a-utf-8-file)? – akash Dec 23 '18 at 13:27
  • the same thing, can you try to run the code sample, i'll update the question in a minute – hdmiimdh Dec 23 '18 at 13:36
  • @UladzislauKuzmin Can you share your XSLT file? Possibly the problem is in your XSLT file... – akash Dec 24 '18 at 02:23