0

I have to read a sequential file which has over a million of records. I have to read each line/record and have to delete that record/line from the file and keep on reading.

Not finding any example on how to do that without using temporary file or creating/recreating a new file of the same name.

These are text files. Each file is about .5 GB big and we have over a million lines/records in each file.

Currently we are copying all the records to memory as we do not want to re-process any record if any thing happens in the middle of the processing of a file.

VictorGram
  • 2,521
  • 7
  • 48
  • 82
  • 1
    What requirement/circumstance prevents you from reading the complete file and then deleting the file? How big is the file? – reto Jun 11 '14 at 20:05
  • Question not very clear: if you have to delete _each_ line, why not just delete entire file after you've read it? – Victor Sorokin Jun 11 '14 at 20:05
  • What type of file? If you're talking about deleting lines from the middle of a text file in-place, you have to shift the entire contents after the removed line to fill in the gap. It will be much more efficient to copy to a new file and overwrite the old file if there's space. – Alex Jun 11 '14 at 20:07
  • possible duplicate of [Java - Find a line in a file and remove](http://stackoverflow.com/questions/1377279/java-find-a-line-in-a-file-and-remove) – DwB Jun 11 '14 at 20:10
  • If all you are doing is removing some of the lines that meet certain criteria, there are many command-line tools that will do that without having to write a Java program. A (partial) list includes `[e|f]grep`, `sed`, `awk`, and `perl` – Stephen P Jun 11 '14 at 20:20
  • What are asking for is actually impossible. You could _overwrite_ a line perhaps, but not remove a line since removing implies that the line after is now immediately after the line preceeding the deleted one, but that involves rewriting the rest of the file. Think of files as arrays and not as linked lists (or more accurately of linked lists of arrays where each element is an inode of ~1kb depending on the file system). – Sled Jun 11 '14 at 20:21
  • Your question is still unclear. Two statements _"I have to read each line/record and have to delete that record"_ **and** _"We need to read theses lines one by one and process it."_ both imply that every line will be deleted when you're done processing each -all of- the records. If you're processing and deleting _some_ records, or processing all but deleting only _some_ be clear about that (and what is "processing"? just deciding what to delete?) -- Otherwise, if you're deleting everything don't bother and just delete the whole file after all processing is done. Please clarify. – Stephen P Jun 11 '14 at 20:33
  • Unless you are working on some kind of highly restricted device, 0.5GB is nothing. A modern phone will let you hold that much data in RAM. Process each file in RAM and rewrite. – DJClayworth Jun 11 '14 at 20:35
  • 1
    I think **now I get it** -- you want to process a record and know you've processed it in case the application or system crashes, power goes out, etc. so you don't process it again when you come back up. Deleting the record when you're done means you can just start reading at the beginning of the file when you re-run. Reading everything into memory means you've lost where you are if the app crashes part-way through the data. Deleting from the start of the file will *always* mean re-writing the remainder of the file after *every* record so it is kept up-to-date on disk. – Stephen P Jun 11 '14 at 21:01
  • You got it right Stephen. – VictorGram Jun 11 '14 at 21:29
  • @Patty You should rewrite the question to make that clear. There are much more efficient ways of doing what Stephen described than the answer you have accepted. – DJClayworth Jun 12 '14 at 15:58

4 Answers4

4

Assuming that the file in question is a simple sequential file - you can't. In the Java file model, deleting part of a file implies deleting all of it after the deletion point.

Some alternative approaches are:

  • In your process copy the file, omitting the parts you want deleted. This is the normal way of doing this.
  • Overwrite the parts of the file you want deleted with some value that you know never occurs in the file, and then at a later date copy the file, removing the marked parts.
  • Store the entire file in memory, edit it as required, and write it again. Just because you have a million records doesn't make that impossible. If your files are 0.5GB, as you say, then this approach is almost certainly viable.
  • Each time you delete some record, copy all of the contents of the file after the deletion to its new position. This will be incredibly inefficient and error-prone.

Unless you can store the file in memory, using a temporary file is the most efficient. That's why everyone does it.

If this is some kind of database, then that's an entirely different question.

EDIT: Since I answered this. comments have indicated that what the user wants to do is use deletion to keep track of which records have already been processed. If that is the case, there are much simpler ways of doing this. One good way is to write a file which just contains a count of how many bytes (or records) of the file have been processed. If the processor crashes, update the file by deleting the records that have been processed and start again.

DJClayworth
  • 26,349
  • 9
  • 53
  • 79
  • FWIW, it's possible to do the 4th option in a single pass by maintaining separate read vs. write offsets so it's still fairly efficient, but still error-prone (and can leave the file in an inconsistent state if the program crashes) so I'd still go the copy route unless there's a really good reason not to. – Alex Jun 11 '14 at 20:18
  • You could do that... or you could keep a running total of the number of bytes to move each subsequent record after the first deleted record, and increment it with each new record you want to delete, so each record only gets moved once. – Alex Jun 11 '14 at 20:30
1

Files are unstructured streams of bytes; there is no record structure. You can not "delete" a "line" from an unstructured stream of bytes.

The basic algorithm you need to use is this:

  1. create temporary file.
  2. open input file
  3. if at the end of the file, goto line 7
  4. read a line from the input file
  5. if the line is not to be deleted, write it to the temporary file
  6. goto line 3
  7. close the input file.
  8. close the temporary file.
  9. delete (or just rename) the input file.
  10. rename (or move) the temporary file to have the original name of the input file.
DwB
  • 37,124
  • 11
  • 56
  • 82
0

There is a similar question asked, "Java - Find a line in a file and remove".

Basically they all use a temp file, there is no harm doing so. So why not just do it? It will not affect your performance much and can avoid some errors.

Community
  • 1
  • 1
Anton
  • 559
  • 2
  • 15
  • We have to process millions of data and that's why I am looking for options that will give better performance. – VictorGram Jun 11 '14 at 20:09
  • @Patty I don't know that you can just modify an existing file. Most libraries just copy and modify. Some file systems enforce that (ie ones that deal with historical data). If you want to modify a file in-situ, I doubt you'll able to do it without some filesystem-specific APIs and definitely not in Java. – Sled Jun 11 '14 at 20:12
0

Why not a simple sed -si '/line I want to delete/d' big_file?

Iazel
  • 2,296
  • 19
  • 23
  • That just hides the temp file behind the scenes doesn't it? The OP is asking for performance reasons to do the change in place. – Sled Jun 11 '14 at 20:22
  • Doing it "in place" will be horribly expensive (multiple shifting lines around). – vonbrand Jun 11 '14 at 20:35
  • @vonbrand Yeah, it's a terrible idea but it is what the asker is asking for. It seems they think of files as linked-lists of characters. – Sled Jun 11 '14 at 20:49
  • @ArtB, perhaps they should use something else (SQLite, some indexed file library, ...) instead? – vonbrand Jun 12 '14 at 01:35