2

I'm Running Ubuntu 16.04 LTS with Python 3.6.8 and I have the following code that allows me to iterate over lines in a file where I process each row and append the data to a database. I need to process a row, and then delete it or replace it with an \n or do anything to reduce the file-size of the text file. Also, I need at most 2 copies of the file: database and first-line-deleted file.

with open(filename, buffering=1000) as f:
    for rows in f:
        #process text
        #delete row or replace with '\n'

How exactly do I do this?

Soumitra Shewale
  • 165
  • 1
  • 1
  • 13

2 Answers2

1

You have a big problem here: deleting the middle of a file isn't something you can do on most operating systems and their filesystems, and if you can, it's an esoteric operation with complicated restraints.

So, the normal way to delete from the middle of a file, is to rewrite the entire file. But you seem to indicate in the comments that your file is hundreds of gigabytes. So reading the whole file, processing one line, and rewriting the whole file is going to be expensive and require extra temporary storage space. If you want to do this for every line, you'll end up doing far more work and require about double the amount of disk space anyway.

If you absolutely have to do this, here are some possibilities:

  • Read the file backwards and truncate it as you go. Reading it backwards is going to be awkward because not much is set up to help with that, but in principle this is possible and you can truncate the end of a file like this without needing to copy it.
  • Use smaller files, and delete each file after you've processed it. This depends on you being able to change how the files are created, but if you can do it it's much simpler and lets you delete processed pieces sooner.

On the other hand, do you definitely need to? Is the problem that the file is so big that the database will run out of room if it's still on the disk? Or do you just want to process more huge files simultaneously? If the latter, have you checked that processing multiple files simultaneously actually goes faster than doing the same files one after the other? And of course, could you buy more disks or a bigger disk?

Weeble
  • 17,058
  • 3
  • 60
  • 75
1

You can re-write portions of the file, you just can't do arbitrary insertion / removal since the length can't change. If the final consumer of the file ignores # comment lines, or whitespace, then you're golden. In database parlance, where each record carries a type attribute, we would describe this as setting the record type to "tombstone".

As you read each line or chunk, use tell() to find its beginning file position. Decide whether to delete it. If so, use seek() to back up to that position, and write() whitespace whiteout (such as blanks + \n newline) over the offending record. Then continue reading.

J_H
  • 17,926
  • 4
  • 24
  • 44
  • Would this work if the file is too large to be read into RAM? I’m using a buffer size of 1000. – Soumitra Shewale Mar 23 '19 at 15:55
  • That's why you want to read line-by-line, or by chunk, so you only have a portion of the file in RAM at any instant. For line-by-line you would need a 2nd file descriptor open for update, so as not to disturb the line iterator. For binary chunks, your app will explicitly `seek()` to the proper place before each read. – J_H Mar 23 '19 at 16:14
  • 1
    This doesn't reduce the file size, which I understood to be a requirement of the question. "I need to process a row, and then delete it or replace it with an \n or do anything _to reduce the file-size of the text file."_ I wonder if @SoumitraShewale could clarify on this point? – Weeble Mar 24 '19 at 01:59
  • 1
    Reducing the file size as the OP extracts data from the original file seemed to be the OPs priority based on his post and comments. – Life is complex Mar 24 '19 at 03:21
  • Yes, @Weeble and Lifeiscomplex are correct. This seemed like the most logical solution, so I marked it as the correct one. I apologize. Anyways, I don't need this anymore, I just decided to get a bigger SSD. – Soumitra Shewale Mar 24 '19 at 06:26
  • @SoumitraShewale no worries. Your question has prompted me look at chunking more, especially surrounding JSON files. I almost have a solution that works for me, because I also need to process large JSON files. Happy Coding!! – Life is complex Mar 24 '19 at 13:56
  • In (somewhat vague) question and comments the OP expressed a reluctance to consume additional disk space in the course of deleting some arbitrary subset of records. I described a way to accomplish that with zero additional bytes of storage. As a post-processing step one could choose to filter to a 2nd file. Or to simply use a utility like `gzip`, with compression ratios improved by the "boring" repeated tombstone data. That would temporarily consume extra storage, which it seemed the OP could not afford. – J_H Mar 24 '19 at 14:55