1

I have a method, where a .txt file is parsed with Scanner, reassembled with DocumentBuilder, and transformed into an .xml file with TransformerFactory.

Everything works fine, with the exception of a little inconvenience: The file that is created that way contains what I asume to be a BOM at the beginning of its name. I'm encoding in UTF-8.

It's saved under %EF%BB%BFexample.xml instead of example.xml.

How can I avoid that?

EDIT: As you can see in the comments, I was pointed to the possibility, that the first line fileTitle which is read by Scanner from userText probably contains the BOM for UTF-8, what turned out to be true (again, see comments).

private void writeXML() {
    try {
        File userText = new File(passedPath);

        Scanner scn = new Scanner(new FileInputStream(userText), "UTF-8");

        String separate = ";";
        String fileTitle = scn.nextLine();
        int indSepTitle = fileTitle.indexOf(separate);
        fileTitle = fileTitle.substring(0,indSepTitle);

        String fileOutputName = fileTitle+".xml";
        File mOutFile = new File(getFilesDir(), fileOutputName);

        DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder docBuilder = docFactory.newDocumentBuilder();

        //root element
        Document doc = docBuilder.newDocument();
        Element rootElement = doc.createElement("Collection");
        doc.appendChild(rootElement);

        //List element
        Element listElement = doc.createElement("List");
        rootElement.appendChild(listElement);

        //set Attributes to listElement
        Attr attr = doc.createAttribute("name");
        attr.setValue(fileTitle);
        listElement.setAttributeNode(attr);

        while(scn.hasNext()) {
            String line = scn.nextLine();
            String[] parts = line.split(separate);

            //vocabulary element
            Element ringElement = doc.createElement("element_ring");
            listElement.appendChild(n1Element);

            //add 1st Element
            Element n1Element = doc.createElement("element1");
            natWord.appendChild(doc.createTextNode(parts[0]));
            ringElement.appendChild(n1Element);

            //add 2ndElement
            Element n2Element = doc.createElement("element2");
            forWord.appendChild(doc.createTextNode(parts[1]));
            ringElement.appendChild(n2Element);

            ...
            //add other Elements accordingly
            ...
            }

        //write the content into xml file
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "2");
        DOMSource source = new DOMSource(doc);
        StreamResult result = new StreamResult(mOutFile);

        transformer.transform(source, result);


    } catch (ParserConfigurationException e) {
        e.printStackTrace();
    }
    catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (TransformerConfigurationException e) {
        e.printStackTrace();
    } catch (TransformerException e) {
        e.printStackTrace();
    }

}
Schelmuffsky
  • 320
  • 1
  • 13
  • I suspect the BOM is already present in your `userText` file, and also returned from the `Scanner`. See https://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java if that is indeed the case. – Thomas Jun 15 '18 at 11:30
  • `userText` is in `UTF-8`, if that's what you mean. The problem from your link sounds similar, but the first line that is retrieved from it via `Scanner` is saved in a `String`, which doesn^t contain said BOM, I looked at it with the debugger. How can that be? – Schelmuffsky Jun 15 '18 at 11:36
  • 1
    How did you verify that it doesn't contain the BOM? Note that if you just print or log it, it may be invisible. Try printing or logging `fileTitle.codePointAt(0)`, which should give 0xfeff = 65279 if this is a BOM. [Edit: or asking your debugger to evaluate that expression.] – Thomas Jun 15 '18 at 11:38
  • You are right: `fileTitle.codePointAt(0)` returns 65279 and `fileTitle.codePointAt(1)` returns: 79, which is the actual first letter from `userText`. – Schelmuffsky Jun 15 '18 at 11:52
  • @Thomas: You should add an answer based on your comment. – kjhughes Jun 15 '18 at 12:03

1 Answers1

1

For the sake of completion:

I included the following short code to remove the BOM from the String which is extracted to serve as the title name for the .xml file being created.

char[] titleChars = fileTitle.toCharArray();

        String cutTitle = "";
        for(int i=1; i<titleChars.length;i++){
            cutTitle = cutTitle+titleChars[i];
        }

String fileOutputName = cutTitle+".xml";
Schelmuffsky
  • 320
  • 1
  • 13