0

I have some data in an xml file and I am using the Process library to parse thru that file. I ran into the BOM marker issue, that caused some errors to be thrown. I found a work around elsewhere, which is very slow: I'm using Apache Commons BOMInputStream to read the file as a bunch of bytes, after skipping the ones that represent that BOM data.

I think that the source of my problem is actually my lack of knowledge about streams, readers and writers. There are so many different readers and writers and all kinds of "streams" (a word I barely understand) that I want to pull my hair out trying to figure out which one to use and how. I think I just picked the wrong implementation.

Question: Can someone show me why my code is so slow, and also help me improve my understanding of file i/o?

Code:

private static XML noBOM(String filename, PApplet p) throws FileNotFoundException, IOException{

    ByteArrayOutputStream out = new ByteArrayOutputStream();
    File f = new File(filename);
    InputStream stream = new FileInputStream(f);
    BOMInputStream bomIn = new BOMInputStream(stream);

    int tmp = -1;
    while ((tmp = bomIn.read()) != -1){
        out.write(tmp);
    }

    String strXml = out.toString();
    return p.parseXML(strXml);
}

public static Map<String, Float> lifeExpectancyFromXML(String filename, PApplet p, 
        int year) throws FileNotFoundException, IOException{


    Map<String, Float> dataMap = new HashMap<>();

    XML xml = noBOM(filename, p);

    if(xml != null){

        XML[] records = xml.getChild("data").getChildren("record");

        for (XML record : records){
            XML[] fields = record.getChildren("field");

            String country = fields[0].getContent();
            int entryYear = fields[2].getIntContent();
            float lifeEx = fields[3].getFloatContent();

            if (entryYear == year){
                System.out.println("Country: " + country);
                System.out.println("Life Expectency: " + lifeEx);
                dataMap.put(country, lifeEx);
            }
        }
    } 
    else {
        System.out.println("String could not be parsed.");
    }

    return dataMap;
} 
rocksNwaves
  • 5,331
  • 4
  • 38
  • 77

2 Answers2

0

Problem is probably, that InputStream is read byte by byte. Try to use buffer to make it more performant:

try (BOMInputStream bis = new BOMInputStream(new FileInputStream(new File(filename)))) {
    byte[] buffer = new byte[1000];
    while (bis.read(buffer) != -1) {
        out.write(buffer);
    }
}

Updated:

Resulting ByteArrayOutputStream may contain some empty bytes in the end. To remove them trim the resulting string:

out.toString("UTF-8").trim()
Daniil
  • 913
  • 8
  • 19
  • This method seems to leave some trailing data that is considered illegal. I now get the error `org.xml.sax.SAXParseException; Content is not allowed in trailing section`. – rocksNwaves Feb 11 '20 at 16:28
  • Updated answer with an example of removing trailing characters – Daniil Feb 11 '20 at 16:34
  • I'm afraid the error persists, even after your update. Also, would you mind explaining what the buffer is doing exactly? – rocksNwaves Feb 11 '20 at 16:46
  • When using buffer data are read from InputStream and written to OutputStream by blocks, each block of buffer size. It should increase performance of reading/writing. – Daniil Feb 11 '20 at 20:05
  • Do you have an example of resulting string (written to log or from debug data), which is retrieved after all that processing? Does it have leading or trailing whitespaces or other characters? – Daniil Feb 11 '20 at 20:08
0

My solution was to use BufferedReader instead of creating my own buffer. It made everything quite speedy:

private static XML noBOM(String path, PApplet p) throws 
            FileNotFoundException, UnsupportedEncodingException, IOException{

        //set default encoding
        String defaultEncoding = "UTF-8";

        //create BOMInputStream to get rid of any Byte Order Mark
        BOMInputStream bomIn = new BOMInputStream(new FileInputStream(path));

        //If BOM is present, determine encoding. If not, use UTF-8
        ByteOrderMark bom = bomIn.getBOM();
        String charSet = bom == null ? defaultEncoding : bom.getCharsetName();

        //get buffered reader for speed
        InputStreamReader reader = new InputStreamReader(bomIn, charSet);
        BufferedReader breader = new BufferedReader(reader);

        //Build string to parse into XML using Processing's PApplet.parsXML
        StringBuilder buildXML = new StringBuilder();
        int c;
        while((c = breader.read()) != -1){
            buildXML.append((char) c);
        }
        reader.close();
        return p.parseXML(buildXML.toString());
    }
rocksNwaves
  • 5,331
  • 4
  • 38
  • 77