0

I am having trouble splitting the records of a marc21 format file. I am reading from one file and trying to separate the records into separate lines, then write into a different file. Here is what I currently have:

import java.io.*;

public class Main {
    public static void main(String[] args) throws IOException{
        FileReader fr = null;
        BufferedReader br = null;
        FileWriter fw = null;
        BufferedWriter bw = null;

        try{
            fr = new FileReader("data.txt");
            br = new BufferedReader(fr);
            fw = new FileWriter("SplitRecords.txt");
            bw = new BufferedWriter(fw);

            String data;
            String recordLength = "";
            int intLength = 0;
            int lengthStart = 0;
            int lengthEnd = 5;

            while((data = br.readLine()) != null){
                while(data != null){
                    recordLength = data.substring(lengthStart, lengthEnd);
                    System.out.println(recordLength);
                    intLength = Integer.parseInt(recordLength);

                    bw.write(data, lengthStart, intLength);
                    bw.write("\n");
                    bw.flush();

                    lengthStart = intLength;
                    lengthEnd = lengthStart + 5;
                    br.mark(intLength);             
                    br.reset();
                }
            }
        }
        finally{
            if(fr != null){
                fr.close();
            }
            if(br != null){
                br.close();
            }
            if(fw != null){
                fw.close();
            }
            if(bw != null){
                bw.close();
            }
        }
    }
}

This is the output and error I am getting:

00934  
00699  
1cRT  
Exception in thread "main" java.lang.NumberFormatException: For input string: "1cRT"  
    at java.lang.NumberFormatException.forInputString(Unknown Source)  
    at java.lang.Integer.parseInt(Unknown Source)  
    at java.lang.Integer.parseInt(Unknown Source)  
    at Main.main(Main.java:26)  

It writes for the first record and the second into the file, however the third loop does not read the length properly. Does anyone have any idea why is this happening?

h7r
  • 4,944
  • 2
  • 28
  • 31

1 Answers1

0

As the System.out.println output shows, a string "1cRT" was read into recordLength, which is not parseable into a integer (or any normal numeric value). Integer.parseInt is throwing an exception because of this.

You should double check if your input data matches the format you are expecting.

EDIT: Looking at the source of your pasted output one can see that there is, in "1cRT", evaluated as string, an unicode character. I am not familiar with the data format you are expecting, but one valid possibility is that the chunk of input you are treating as recordLength (i.e. offsets 0 to 5) should not be treated as String's characters length, but length in bytes instead, as String.substring is cutting your string byte by byte.

EDIT 2: The assumption is correct. As per the Marc21 specification the encoding of the record length is a five-character ASCII numeric string. Therefore, one way of correcting the issue would be replacing

recordLength = data.substring(lengthStart, lengthEnd);

with (untested)

recordLength = new String(Arrays.copyOfRange(data.getBytes(), lengthStart, lengthEnd), "US-ASCII");

Alternatively, you might prefer to refer to this StackOverflow answer about encoding in FileReaders and adjust the reading and writing of the files.

Community
  • 1
  • 1
h7r
  • 4,944
  • 2
  • 28
  • 31