XML Splitter using Java - Index is setting offset from the tag omitting the beginning spaces

Question

My Source xml file looks like this

   <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE DocName PUBLIC "-//msg//msg1 Project_Name 1.1//EN" "My_Project_Name_V1_1.dtd">
<My_Project_Name dtdVersion="V1_1" fileName="Guidance_Document_SQL" softwareName="prototype" softwareVersion="0.1" productionDate="2012-01-02">
    <ApplicantFileReference>ABCD#1234</ApplicantFileReference>
    <ApplicantName languageCode="EF">Michael Smith</ApplicantName>
    <ApplicantNameLatin>Michael Smith </ApplicantNameLatin>
    <ProductTitle languageCode="EF">Some Example </InventionTitle>
    <TotalQuantity>88</TotalQuantity>
    <Example_Data exampleIDNumber="1">
        <Exm_Seq>
            <Exm_Seq_length>7</Exm_Seq_length>
            <Exm_Seq_type>MM</Exm_Seq_type>
            <Exm_Seq_div>PAT</Exm_Seq_div>
        <Exm_Seq>
    </Example_Data>

I am splitting this file and creating 2 files. One is .header file and other is .body file. Body file will start from "Example_Data" tag. Now the problem is when the .body file is created the content is creating right from the start of the file without considering the spaces. Like following:

<Example_Data exampleIDNumber="1">
            <Exm_Seq>
                <Exm_Seq_length>7</Exm_Seq_length>
                <Exm_Seq_type>MM</Exm_Seq_type>
                <Exm_Seq_div>PAT</Exm_Seq_div>
            <Exm_Seq>
        </Example_Data>

But I want to consider the spaces too so the content in the body file starts from the position it has in the original file(after 4 spaces or whatever number of spaces before the Example_Data tag has. I can hardcode for 4 spaces but it's not going to help my cause because there are other files where there could be more spaces before this tag).

Here is the piece of code I am working on for the splitting:

public class Splitter {
    public static void main(String[] args) {
        String charset = "UTF-8";
        String original = args[0];
        String stem = original.substring(0, original.length() - 4);
        String headName = stem + ".head";
        String bodyName = stem + ".body";
        String bodyStart ="<Example_Data";
        try {
            //get rid of existing split files
            File existing = new File(headName);
            if(existing.exists()){
                existing.delete();
                System.out.println("Old header File has been deleted");
            }
            existing = new File(bodyName);
            if(existing.exists()){
                existing.delete();
                System.out.println("Old body file has been deleted");
            }
            //read in original file
            StringBuilder fileData = new StringBuilder(1000);
            FileInputStream fis = new FileInputStream(original);
            InputStreamReader fileReader = new InputStreamReader(fis,charset);
            BufferedReader reader = new BufferedReader(fileReader);  
            char[] buf = new char[10];
            System.out.println("Reading xml file");
            int numRead = 0;
            while ((numRead = reader.read(buf)) != -1) {
                String readData = String.valueOf(buf, 0, numRead);
                fileData.append(readData);
                buf = new char[1024];
            }
            reader.close();
            String content = fileData.toString();
            System.out.println("File reading completed");
            //split
            System.out.println("File Splitting process Started");
            int indx = content.indexOf(bodyStart);
            String head = content.substring(0, indx - 1);
            String body = content.substring(indx);
            //write to head file
            OutputStreamWriter headFile = new OutputStreamWriter(new FileOutputStream(headName), charset);
            headFile.write(head);
            System.out.println("New header file created");
            //headFile.flush();
            headFile.close();
            //write body to body file
            OutputStreamWriter bodyFile = new OutputStreamWriter(new FileOutputStream(bodyName), charset);
            bodyFile.write(body);
            System.out.println("New body file created");
            bodyFile.close();
        } catch (FileNotFoundException e1) {
            e1.printStackTrace();
        } catch (IOException e1) {
            e1.printStackTrace();
        } finally {
            ;
        }
    }
}

I am not sure how to approach this. Any advice will be much appreciated.

You can use a Transformer: https://stackoverflow.com/questions/139076/how-to-pretty-print-xml-from-java — Compass, Apr 25 '18 at 19:40
Key word: the spaces or tabs at the start of the line are called [indentation](https://en.wikipedia.org/wiki/Indentation_(typesetting)#Indentation_in_programming) — Rory O'Kane, Apr 28 '18 at 00:10

Tschallacka · Accepted Answer · 2018-04-26T09:35:07.963

You're probaly looking for a little logic check

If the character before the splitting half is not a closing bracket of a tag(>) then assume it was made on a newline.
If it's a assumed a newline, find the last newline character in head.
Split body on that newline.
If it's not matching newline criteria split on found index because the xml might all be without newlines.

See it online: https://ideone.com/0degt4

The finished code:

String content ="   <?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"+
"<!DOCTYPE DocName PUBLIC \"-//msg//msg1 Project_Name 1.1//EN\" \"My_Project_Name_V1_1.dtd\">\n"+
"<My_Project_Name dtdVersion=\"V1_1\" fileName=\"Guidance_Document_SQL\" softwareName=\"prototype\" softwareVersion=\"0.1\" productionDate=\"2012-01-02\">\n"+
"    <ApplicantFileReference>ABCD#1234</ApplicantFileReference>\n"+
"    <ApplicantName languageCode=\"EF\">Michael Smith</ApplicantName>\n"+
"    <ApplicantNameLatin>Michael Smith </ApplicantNameLatin>\n"+
"    <ProductTitle languageCode=\"EF\">Some Example </InventionTitle>\n"+
"    <TotalQuantity>88</TotalQuantity>\n"+
"    <Example_Data exampleIDNumber=\"1\">\n"+
"        <Exm_Seq>\n"+
"            <Exm_Seq_length>7</Exm_Seq_length>\n"+
"            <Exm_Seq_type>MM</Exm_Seq_type>\n"+
"            <Exm_Seq_div>PAT</Exm_Seq_div>\n"+
"        <Exm_Seq>\n"+
"    </Example_Data>";

// Define newline character to look for. \r \r\n \n         
String newLine = "\n";

// Where the body starts        
String bodyStart ="<Example_Data";

// Base index defined by bodyStart
int indx = content.indexOf(bodyStart);

// Grab the head.
String head = content.substring(0, indx - 1);

// Find the last index of newline
int lastNewline = head.lastIndexOf(newLine);
String body;
// If we found a newline in head and the character before our match isn't a closing bracket, get content from newline
if(lastNewLine != -1 && content.charAt(indx - 1) != '>') {
    body = content.substring(lastNewline + 1);
}
// business as usual
else {
    body = content.substring(indx);
}

System.out.println(body);

Excellent Answer and explanation. This worked like charm. I have tested with other files too where beginning tag has different offset value and writing it in the new file in proper position. Thanks a lot mate! — jony70, Apr 26 '18 at 14:10
Tip for the future, write out what you would do if you had to manipulate the file by hand step by step. Then write the code to do that. You're welcome — Tschallacka, Apr 26 '18 at 14:44

XML Splitter using Java - Index is setting offset from the tag omitting the beginning spaces

1 Answers1