4

This question has already been asked here and here, but none of the solutions worked for me.

How do I remove the first line from a large file efficiently in Python 3?

I am writing a program which requires logging, and the log file has a configurable maximum size, which could be infinite. Therefore, I do not want to use readlines() or similar methods as these would be memory intensive. Speed is not a huge concern, but if it can be done without rewriting the entire file, and without temporary files, that would be great.

Solutions need to be cross-platform.

Example log file:

[09:14:56 07/04/17] [INFO] foo
[23:45:01 07/04/17] [WARN] bar
[13:45:28 08/04/17] [INFO] foobar
... many thousands more lines

Output:

[23:45:01 07/04/17] [WARN] bar
[13:45:28 08/04/17] [INFO] foobar
... many thousands more lines

This code will be run in a loop:

while os.path.getsize(LOGFILE) > MAXLOGSIZE:
    # remove first line of file

None of the following solutions work and are memory efficient:

Solution #1 - works but inefficient

with open('file.txt', 'r') as fin:
    data = fin.read().splitlines(True)
with open('file.txt', 'w') as fout:
    fout.writelines(data[1:])

Solution #2 - doesn't work, leaves file empty

import shutil

source_file = open('file.txt', 'r')
source_file.readline()
target_file = open('file.txt', 'w')

shutil.copyfileobj(source_file, target_file)

Solution #3 - works, efficient, but uses additional file:

with open("file.txt",'r') as f:
    with open("new_file.txt",'w') as f1:
        f.next() # skip header line
        for line in f:
            f1.write(line)
martineau
  • 119,623
  • 25
  • 170
  • 301
retnikt
  • 586
  • 4
  • 18

2 Answers2

3

So, this approach is very hacky. It will work well if your line-sizes are about the same size with a small standard deviation. The idea is to read some portion of your file into a buffer that is small enough to be memory efficient but large enough that writing form both ends will not mess things up (since the lines are roughly the same size with little variance, we can cross our fingers and pray that it will work). We basically keep track of where we are in the file and jump back and forth. I use a collections.deque as a buffer because it has favorable append performance from both ends, and we can take advantage of the FIFO nature of a queue:

from collections import deque
def efficient_dropfirst(f, dropfirst=1, buffersize=3):
    f.seek(0)
    buffer = deque()
    tail_pos = 0
    # these next two loops assume the file has many thousands of
    # lines so we can safely drop and buffer the first few...
    for _ in range(dropfirst):
        f.readline()
    for _ in range(buffersize):
        buffer.append(f.readline())
    line = f.readline()
    while line:
        buffer.append(line)
        head_pos = f.tell()
        f.seek(tail_pos)
        tail_pos += f.write(buffer.popleft())
        f.seek(head_pos)
        line = f.readline()
    f.seek(tail_pos)
    # finally, clear out the buffer:
    while buffer:
        f.write(buffer.popleft())
    f.truncate()

Now, let's try this out with a pretend file that behaves nicely:

>>> s = """1. the quick
... 2. brown fox
... 3. jumped over
... 4. the lazy
... 5. black dog.
... 6. Old McDonald's
... 7. Had a farm
... 8. Eeyi Eeeyi Oh
... 9. And on this farm they had a
... 10. duck
... 11. eeeieeeiOH
... """

And finally:

>>> import io
>>> with io.StringIO(s) as f: # we mock a file
...     efficient_dropfirst(f)
...     final = f.getvalue()
...
>>> print(final)
2. brown fox
3. jumped over
4. the lazy
5. black dog.
6. Old McDonald's
7. Had a farm
8. Eeyi Eeeyi Oh
9. And on this farm they had a
10. duck
11. eeeieeeiOH

This should work out OK if dropfirst < buffersize by a good bit of "slack". Since you only want to drop the first line, just keep dropfirst=1, and you can maybe make buffersize=100 or something just to be safe. It will be much more memory efficient than reading "many thousands of lines", and if no single line is bigger than the previous lines, you should be safe. But be warned, this is very rough around the edges.

juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172
  • After extensive testing, this seems to work 100% of the time. From the code it seems like what you said should be right - it should behave unreliably. But unexpected reliability is fine by me! – retnikt May 01 '17 at 11:04
  • 1
    @retnikt if you enforced a line length (filling in where it doesn't reach the end, making a new line when it goes over) then you could get reliable behavior. That might be more trouble than it's worth – juanpa.arrivillaga May 01 '17 at 11:12
  • 1
    So, an example where it *wont* work, say there is a very long line, about len > 200, and there are 100 lines that came previously with only a single character (i.e. a new-line), then it will fail, and not prettily. – juanpa.arrivillaga May 01 '17 at 11:14
  • So that's what you meant. I misunderstood you. Sorry. – retnikt May 01 '17 at 11:15
1

Try this. It uses 3rd approach as you mentioned but won't make a new file.

filePath = r"E:\try.txt"
file_str = ""
with open(filePath,'r') as f:
        f.next()  # skip header line
        for line in f:
            file_str = file_str + line

with open(filePath, "w") as f:
    f.write(file_str)
Prakhar Verma
  • 457
  • 3
  • 12
  • This is not a memory efficient solution. Also, I get this error: `AttributeError: '_io.TextIOWrapper' object has no attribute 'next'`. Is that because it's a 3rd party solution and requires some other module? – retnikt May 01 '17 at 08:28
  • @retnikt no, it's because in python 3 you need to use `next(f)` instead of `f.next()` – juanpa.arrivillaga May 01 '17 at 09:14
  • This isn't solution for large files. Your script will fail because all of the memory will be used. – Aleksandar Makragić Nov 20 '18 at 10:09