0

I'm trying to write a program that filters through data. The data contains 27,000 lines and is over 150mb. No matter how I try to implement the function, it stops printing prematurely around line 4,300. I've tested the loop without printing data (just printing the line number) and it reaches the full 27,000 lines. I'm thinking this might be a memory issue, but since i'm so new at Java, I'm not particularly sure where the problem might be. The two main suspects right now are line.substring and the PrintStream classes. Please help!

public static void main(String[] args) {
  // tries to print output to output.csv in same directory
  try {
     PrintStream out = new PrintStream(new FileOutputStream("output.csv"));
     System.setOut(out);
  }
  catch(IOException e1) {
    System.out.println("Error during reading/writing");
  }

  // read input file
  File inputFile = new File("my-large-file.txt");

  if(!inputFile.canRead()) {
     System.out.println("Required input file not found; exiting.");
     System.exit(1);
  }

  // doesn't allow me to use scanner without try for some reason
  try {
     Scanner input = new Scanner(inputFile);

     while (input.hasNextLine()) {
        String line = input.nextLine();

        // scan through each line
        Scanner lineScan = new Scanner(line);

        // if we find the line that we want to look through
        if(lineScan.next().startsWith("1")) {

           // prints the specific data to output
           String a= line.substring(007, 666);         
           if (!(a== "the-number-that-I-don't-want")) {
              String current         = line.substring(1, 10);
              String another         = line.substring(10, 20).replaceAll("\\s+","");
              String third           = line.substring(20, 30).replaceAll("\\s +","");
              String fourth          = line.substring(40, 50);
              ...
              String nth             = line.substring(999, 1000);


              System.out.print(current + ", ");
              System.out.print(another + ", ");
              System.out.print(third + ", ");
              System.out.print(fourth + ", ");
              ...
              System.out.print(nth);
              System.out.println();

           }
        }
     }
   }
  catch(IOException e) {
     e.printStackTrace();
  } 

}

Alex Mac
  • 63
  • 8
  • What parameters are you using for the heap size when you run the program? You may have to up the memory size for this to run. See http://stackoverflow.com/questions/1565388/increase-heap-size-in-java – ManoDestra Apr 25 '16 at 16:48
  • Also, you should write this line as follows: `!("the-number-that-I-don't-want".equals(a))` – ManoDestra Apr 25 '16 at 16:51
  • @ManoDestra, I don't know how to set heap size and haven't heard about it before. I am looking it up right now, but I'm assuming it's whatever the default size is in jGrasp? Perhaps. And thank you for your suggestion! I'll update my code – Alex Mac Apr 25 '16 at 16:54
  • This answer specifically should help you here: http://stackoverflow.com/a/15517399/5969411 – ManoDestra Apr 25 '16 at 16:55
  • Ah, I see. You're creating a Scanner and then another Scanner to read each line. You would need to dispose of that second Scanner before you loop again, for sure, to release resources. I would recommend using a CSV library such as [OpenCSV](https://sourceforge.net/projects/opencsv/) for this kind of thing. – ManoDestra Apr 25 '16 at 16:58

2 Answers2

0

The String.substring needs valid indices. And comparison between strings uses equals.

  if (line.length() >= 666) { // Or even 1000
      String a = line.substring(007, 666);         
      if (!a.equals("the-number-that-I-don't-want")) {
      ...
  }

And then you should close everything opened. lineScan and especially input.

In this case, a BufferedReader might be more intuitive than Scanner which splits tokens. BufferedReader is more simple, and likely faster.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • I've been giving BufferedReader a try but it hasn't solved it thus far. I have fixed the .equals as per your and @ManoDestra's suggestion, thanks! i didn't include info about the line length, but it's a fixed length every time. I'll keep hacking and let you know if BufferedReader solves it – Alex Mac Apr 26 '16 at 14:48
0

I was able to figure it out! Thank you guys for pointing me in the right direction.

The problem with my program was that I was storing too much in memory. I was storing the each line in my file, then storing another scanner to scan through the line, storing strings, concatenating strings, etc.

StringBuffer is used instead of String because of their performance gains when doing concatenations.

Here is my revised solution that now works runs through the file and filters as intended:

 public static void main(String[] args) throws IOException {
  FileInputStream inputStream = null;
  Scanner sc = null;
  try {
     PrintStream out = new PrintStream(new FileOutputStream("output.csv"));
     System.setOut(out);
  }
  catch(IOException e1) {
    System.out.println("Error during reading/writing");
  }
  try {
      inputStream = new FileInputStream("my-large-file.txt");
      sc = new Scanner(inputStream, "UTF-8");
      while (sc.hasNextLine()) {
        String line = sc.nextLine();

        // note the specific indecies of the substring are random nums, and does not affect the program. They could be anything.
        if (!line.startsWith("the-number-that-I-don't-want"))) {
           String filter2 = line.substring(55, 66);         
           if (!(filter2.equals("another-string-to-filter-out"))) {
              StringBuffer current     = new StringBuffer(line.substring(1, 10));
              StringBuffer another     = new StringBuffer(line.substring(10, 20).replaceAll("\\s+",""));
              StringBuffer third       = new StringBuffer(line.substring(22, 37).replaceAll("\\s +",""));
              StringBuffer fourth      = new StringBuffer(line.substring(37, 56));

              ...
              StringBuffer nth         = new StringBuffer(line.substring(999, 1000));

              System.out.println(currentS + ", " + firstName + ", " + lastName + ", " + birthday + ", " + distributedAmt + ", " +awardYear + ", " + transactionNum + ", " + disbursementDate + ", " + efc + ", " + percentEligUsed + ", " + grantType);
           }
        }
     }

     if (sc.ioException() != null) {
        throw sc.ioException();
     }
  } finally {
     if (inputStream != null) {
        inputStream.close();
     }
     if (sc != null) {
        sc.close();
     }

  }                                                                              
}

This link helped me out a lot: http://www.baeldung.com/java-read-lines-large-file

Alex Mac
  • 63
  • 8