1

I'm writing a java program that does the following:

  1. Reads in a line from a file
  2. Does some action based on that line
  3. Delete the line (or replace it with ""), and if 2 is not successful, write it to a new file
  4. Continue on to the next line for all lines in file (as opposed to removing an arbitrary line)

Currently I have:

try (BufferedReader br = new BufferedReader(new FileReader(inputFile))) {
            String line;                
            while ((line = br.readLine()) != null) {
                try {
                    if (!do_stuff(line)){ //do_stuff returns bool based on success
                        write_non_success(line);
                     }
                } catch (Exception e) {
                    e.printStackTrace(); //eat the exception for now, do something in the future
            }
        }

Obviously I'm going to need to not use a BufferedReader for this, as it can't write, but what class should I use? Also, read order doesn't matter

This differs from this question because I want to remove all lines, as opposed to an arbitrary line number as the other OP wants, and if possible I'd like to avoid writing the temp file after every line, as my files are approximately 1 million lines

Mitch
  • 1,604
  • 4
  • 20
  • 36
  • You cannot read and write to a file at the same time. You will need to make a copy either in memory or the file system, and then write out to a different file. – Pete B. Apr 27 '16 at 14:28
  • 1
    You can't really delete the line. Deleting a line means physically shifting all the remaining text, and you really don't want to do that. – Andreas Apr 27 '16 at 14:29

1 Answers1

4

If you do everything according to the algorithm that you describe, the content left in the original file would be the same as the content of "new file" from step #3:

  • If a line is processed successfully, it gets removed from the original file
  • If a line is not processed successfully, it gets added to the new file, and it also stays in the original file.

It is easy to see why at the end of this process the original file is the same as the "new file". All you need to do is to carry out your algorithm to the end, and then copy the new file in place of the original.

If your concern is that the process is going to crash in the middle, the situation becomes very different: now you have to write out the current state of the original file after processing each line, without writing over the original until you are sure that it is going to be in a consistent state. You can do it by reading all lines into a list, deleting the first line from the list once it has been processed, writing the content of the entire list into a temporary file, and copying it in place of the original. Obviously, this is very expensive, so it shouldn't be attempted in a tight loop. However, this approach ensures that the original file is not left in an inconsistent state, which is important when you are looking to avoid doing the same work multiple times.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
  • If the concern is that the process is interrupted halfway through, and you cannot process the lines again, other means of tracking progress should be used. If not a concern, this is exactly the right answer. – Andreas Apr 27 '16 at 14:31
  • That's a very good point, and something I hadn't considered about the two files being the same in the end. I've edited the question to be more clear on that... @Andreas is correct about my concern though - the point is to safeguard against the process dying partway through – Mitch Apr 27 '16 at 14:35
  • @Musher Is your process expected to run to completion most of the time, but crash every now and then -- say, less than once in a hundred runs? – Sergey Kalinichenko Apr 27 '16 at 14:50
  • It's meant to be a long running process that polls a directory for files periodically. Ideally the only time it crashes is on a system reboot – Mitch Apr 27 '16 at 14:54
  • The best solution is to make your business logic [idempotent](http://stackoverflow.com/a/1077421/5221149). That way you can just reprocess a partially processed file. If you can't do that, and each line is processed and committed independently (if RDBMS), you need a safe(r) state tracking to skip previously processed lines, e.g. a tiny file with the last processed line number, written and flushed for every completed line. Performance will degrade with such an option, since flushing to disk is expensive. – Andreas Apr 27 '16 at 15:14