How do I delete a row from a file while iterating over rows in a file?

Question

I'm Running Ubuntu 16.04 LTS with Python 3.6.8 and I have the following code that allows me to iterate over lines in a file where I process each row and append the data to a database. I need to process a row, and then delete it or replace it with an \n or do anything to reduce the file-size of the text file. Also, I need at most 2 copies of the file: database and first-line-deleted file.

with open(filename, buffering=1000) as f:
    for rows in f:
        #process text
        #delete row or replace with '\n'

How exactly do I do this?

The line itself has already been processed, so why did you need to delete that row? — Life is complex, Mar 21 '19 at 17:24
@Lifeiscomplex I need to minimize the amount of storage space used by the file. — Soumitra Shewale, Mar 21 '19 at 17:27
Once you process the file completely are you deleting it from the 'disk'? — Life is complex, Mar 21 '19 at 17:30
So again, what is the purpose of deleting each line after processing? This action just adds a layer of complexity that is really not needed. — Life is complex, Mar 21 '19 at 17:33
You could chunk the file into sections and delete each section after processing. — Life is complex, Mar 21 '19 at 17:35
Possible duplicate of [Remove line from a text file after read](https://stackoverflow.com/questions/26573459/remove-line-from-a-text-file-after-read) — Life is complex, Mar 21 '19 at 17:36
@Lifeiscomplex I will be processing hundreds of Gigabytes of data and I want to minimize disk usage so that more files can be processed alongside the current one. Think of it like a paper shredder: I can store only 2 papers: regardless of whether it is shredded or not. I don't want to store 2 copies of the same paper, because I can shred 2 papers at once and get double the work done in the same time. — Soumitra Shewale, Mar 21 '19 at 17:38
@Lifeiscomplex >You could chunk the file into sections and delete each section after processing. How do I do that? — Soumitra Shewale, Mar 21 '19 at 17:41
@Lifeiscomplex >Possible duplicate of Remove line from a text file after read I cannot write the whole file again, I would then have 3 copies: original, processed, and one line deleted. — Soumitra Shewale, Mar 21 '19 at 17:44
@Lifeiscomplex I’m ingesting a text file, containing json data I’m taking some fields of the data and inserting it into an sqlite3 database file. — Soumitra Shewale, Mar 22 '19 at 09:47

score 1 · Accepted Answer · answered Mar 21 '19 at 19:06

You have a big problem here: deleting the middle of a file isn't something you can do on most operating systems and their filesystems, and if you can, it's an esoteric operation with complicated restraints.

So, the normal way to delete from the middle of a file, is to rewrite the entire file. But you seem to indicate in the comments that your file is hundreds of gigabytes. So reading the whole file, processing one line, and rewriting the whole file is going to be expensive and require extra temporary storage space. If you want to do this for every line, you'll end up doing far more work and require about double the amount of disk space anyway.

If you absolutely have to do this, here are some possibilities:

Read the file backwards and truncate it as you go. Reading it backwards is going to be awkward because not much is set up to help with that, but in principle this is possible and you can truncate the end of a file like this without needing to copy it.
Use smaller files, and delete each file after you've processed it. This depends on you being able to change how the files are created, but if you can do it it's much simpler and lets you delete processed pieces sooner.

On the other hand, do you definitely need to? Is the problem that the file is so big that the database will run out of room if it's still on the disk? Or do you just want to process more huge files simultaneously? If the latter, have you checked that processing multiple files simultaneously actually goes faster than doing the same files one after the other? And of course, could you buy more disks or a bigger disk?

score 1 · Answer 2 · answered Mar 22 '19 at 13:55

1

You can re-write portions of the file, you just can't do arbitrary insertion / removal since the length can't change. If the final consumer of the file ignores # comment lines, or whitespace, then you're golden. In database parlance, where each record carries a type attribute, we would describe this as setting the record type to "tombstone".

As you read each line or chunk, use tell() to find its beginning file position. Decide whether to delete it. If so, use seek() to back up to that position, and write() whitespace whiteout (such as blanks + \n newline) over the offending record. Then continue reading.

answered Mar 22 '19 at 13:55

J_H

17,926
4
24
44

Would this work if the file is too large to be read into RAM? I’m using a buffer size of 1000. – Soumitra Shewale Mar 23 '19 at 15:55
That's why you want to read line-by-line, or by chunk, so you only have a portion of the file in RAM at any instant. For line-by-line you would need a 2nd file descriptor open for update, so as not to disturb the line iterator. For binary chunks, your app will explicitly `seek()` to the proper place before each read. – J_H Mar 23 '19 at 16:14
1

This doesn't reduce the file size, which I understood to be a requirement of the question. "I need to process a row, and then delete it or replace it with an \n or do anything _to reduce the file-size of the text file."_ I wonder if @SoumitraShewale could clarify on this point? – Weeble Mar 24 '19 at 01:59
1

Reducing the file size as the OP extracts data from the original file seemed to be the OPs priority based on his post and comments. – Life is complex Mar 24 '19 at 03:21
Yes, @Weeble and Lifeiscomplex are correct. This seemed like the most logical solution, so I marked it as the correct one. I apologize. Anyways, I don't need this anymore, I just decided to get a bigger SSD. – Soumitra Shewale Mar 24 '19 at 06:26
@SoumitraShewale no worries. Your question has prompted me look at chunking more, especially surrounding JSON files. I almost have a solution that works for me, because I also need to process large JSON files. Happy Coding!! – Life is complex Mar 24 '19 at 13:56
In (somewhat vague) question and comments the OP expressed a reluctance to consume additional disk space in the course of deleting some arbitrary subset of records. I described a way to accomplish that with zero additional bytes of storage. As a post-processing step one could choose to filter to a 2nd file. Or to simply use a utility like `gzip`, with compression ratios improved by the "boring" repeated tombstone data. That would temporarily consume extra storage, which it seemed the OP could not afford. – J_H Mar 24 '19 at 14:55

How do I delete a row from a file while iterating over rows in a file?

2 Answers2