1

I need to update the last line from a few more than 2GB files made up of lines of text that can not be read with readlines(). Currently, it work fine by looping through line by line. However, I am wondering if there is any compiled library can achieve this more efficiently? Thanks!

Current approach

    myfile = open("large.XML")
    for line in myfile:
        do_something()
Community
  • 1
  • 1
TTT
  • 4,354
  • 13
  • 73
  • 123
  • If it's XML why aren't you using an XML parser? You should be able to achieve a more efficient I have used ElementTree and like it. – Warren P Nov 19 '15 at 18:32
  • @WarrenP Maybe the OP doesn't need to parse the XML? Also, wouldn't that read a big chunk of the file into memory which is what should be avoided? – Two-Bit Alchemist Nov 19 '15 at 18:33
  • related: http://stackoverflow.com/questions/7171140/using-python-iterparse-for-large-xml-files – Warren P Nov 19 '15 at 18:33
  • Perhaps an in-place edit of a file on disk is the fastest technique. – Warren P Nov 19 '15 at 18:35
  • 1
    Not really a duplicate because this person wants to rewrite the end of the file, not load it into ram. – Warren P Nov 19 '15 at 18:36
  • @WarrenP, thanks for your comments and I am using XML as an example, but I think this question can be applied to other type of plain text files as well. – TTT Nov 19 '15 at 19:19
  • I made some edits, I think you should try to state that you don't want to assume any particular format. But you do need to state why readlines won't work. Obviously for x in y is doing the same kind of read-line logic under the hood, just producing a lot more wasted ram. – Warren P Nov 19 '15 at 20:01
  • I don't understand how this isn't a duplicate but the accepted answer does the same thing as the accepted answer on the duplicate with the exceptions that it (1.) uses `mmap` to navigate to the spot in the file agnostically, rather than using the built-in `file` methods [is this necessary/faster? I don't know] and (2.) opens the file in `r+`, enabling writes, which is the obvious answer to this objection that the OP wants to write to the file rather than just reading it... – Two-Bit Alchemist Nov 20 '15 at 00:24
  • 1
    @Two-BitAlchemist: As you suggested. closed this question. thanks! – TTT Nov 20 '15 at 03:37
  • @Two-BitAlchemist: While [the other question](https://stackoverflow.com/q/136168/364696) is quite similar, adapting it to replace lines is non-trivial (it doesn't find the exact location where the lines in question end, so you have to adapt it to figure that out to perform the necessary truncation). I think it should be linked, but it's not a "duplicate" (it's asking a different question, and the answers are related, but not trivially reusable). – ShadowRanger Jul 01 '19 at 14:22

2 Answers2

6

If this is really something line based (where a true XML parser isn't necessary the best solution), mmap can help here.

mmap the file, then call .rfind('\n') on the resulting object (possibly with adjustments to handle the file ending with a newline when you really want the non-empty line before it, not the empty "line" following it). You can then slice out the final line alone. If you need to modify the file in place, you can resize the file to shave off (or add) a number of bytes corresponding to the difference between the line you sliced and the new line, then write back the new line. Avoids reading or writing any more of the file than you need.

Example code (please comment if I made a mistake):

import mmap

# In Python 3.1 and earlier, you'd wrap mmap in contextlib.closing; mmap
# didn't support the context manager protocol natively until 3.2; see example below
with open("large.XML", 'r+b') as myfile, mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE) as mm:
    # len(mm) - 1 handles files ending w/newline by getting the prior line
    # + 1 to avoid catching prior newline (and handle one line file seamlessly)
    startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1

    # Get the line (with any newline stripped)
    line = mm[startofline:].rstrip(b'\r\n')

    # Do whatever calculates the new line, decoding/encoding to use str
    # in do_something to simplify; this is an XML file, so I'm assuming UTF-8
    new_line = do_something(line.decode('utf-8')).encode('utf-8')

    # Resize to accommodate the new line (or to strip data beyond the new line)
    mm.resize(startofline + len(new_line))  # + 1 if you need to add a trailing newline
    mm[startofline:] = new_line  # Replace contents; add a b"\n" if needed

Apparently on some systems (e.g. OSX) without mremap, mm.resize won't work, so to support those systems, you'd probably split the with (so the mmap closes before the file object), and use file object based seeks, writes and truncates to fix up the file. The following example includes my previously mentioned Python 3.1 and earlier specific adjustment to use contextlib.closing for completeness:

import mmap
from contextlib import closing

with open("large.XML", 'r+b') as myfile:
    with closing(mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE)) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline)  # Move to where old line began
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess

The advantages to mmap over any other approach are:

  1. No need to read any more of the file beyond the line itself (meaning 1-2 pages of the file, the rest never gets read or written)
  2. Using rfind means you can let Python do the work of finding the newline quickly at the C layer (in CPython); explicit seeks and reads of a file object could match the "only read a page or so", but you'd have to hand-implement the search for the newline

Caveat: This approach will not work (at least, not without modification to avoid mapping more than 2 GB, and to handle resizing when the whole file might not be mapped) if you're on a 32 bit system and the file is too large to map into memory. On most 32 bit systems, even in a newly spawned process, you only have 1-2 GB of contiguous address space available; in certain special cases, you might have as much as 3-3.5 GB of user virtual addresses (though you'll lose some of the contiguous space to the heap, stack, executable mapping, etc.). mmap doesn't require much physical RAM, but it needs contiguous address space; one of the huge benefits of a 64 bit OS is that you stop worrying about virtual address space in all but the most ridiculous cases, so mmap can solve problems in the general case that it couldn't handle without added complexity on a 32 bit OS. Most modern computers are 64 bit at this point, but it's definitely something to keep in mind if you're targeting 32 bit systems (and on Windows, even if the OS is 64 bit, they may have installed a 32 bit version of Python by mistake, so the same problems apply). Here's yet one more example that works (assuming the last line isn't 100+ MB long) on 32 bit Python (omitting closing and imports for brevity) even for huge files:

with open("large.XML", 'r+b') as myfile:
    filesize = myfile.seek(0, 2)
    # Get an offset that only grabs the last 100 MB or so of the file aligned properly
    offset = max(0, filesize - 100 * 1024 ** 2) & ~(mmap.ALLOCATIONGRANULARITY - 1)
    with mmap.mmap(myfile.fileno(), 0, access=mmap.ACCESS_WRITE, offset=offset) as mm:
        startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1
        # If line might be > 100 MB long, probably want to check if startofline
        # follows a newline here
        line = mm[startofline:].rstrip(b'\r\n')
        new_line = do_something(line.decode('utf-8')).encode('utf-8')

    myfile.seek(startofline + offset)  # Move to where old line began, adjusted for offset
    myfile.write(new_line)  # Overwrite existing line with new line
    myfile.truncate()  # If existing line longer than new line, get rid of the excess
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • The OP seems to want to change the last line not add a new one – Padraic Cunningham Nov 19 '15 at 19:01
  • 1
    Just use `mm.rfind(b'\n', 0, len(mm) - 1)`. If the last byte is a newline, that will skip it. If It's anything else including a one character line or zero character line, the code will still work. – Harvey Nov 19 '15 at 19:03
  • @PadraicCunningham: Yeah, my code does that. It's replacing whatever the last line was with a brand new line. – ShadowRanger Nov 19 '15 at 19:04
  • @Harvey: Good point, though in this case, I'm also using `endofline` to limit the slice (to omit the newline). Arguable whether that's "correct" behavior I suppose. If you want the newline if it exists, then yeah, it could simplify. I'll probably change the code to use an `rstrip` to make that explicit though. – ShadowRanger Nov 19 '15 at 19:04
  • I mean they want to update the line in the original file – Padraic Cunningham Nov 19 '15 at 19:07
  • @PadraicCunningham: Yeah. That's what this does. It opens the file, mutates it in place, then closes it. It slices out the final line of the file, processes it to create some replacement line, then replaces the existing final line with the new line. All in the same file. The original file no longer has the original final line, instead having the replacement line beginning at the place the old line was. Am I missing something? – ShadowRanger Nov 19 '15 at 19:12
  • 1
    Bummer, on OSX: "SystemError: mmap: resizing not available--no mremap()". It looks like the solution is to close the file, reopen, seek to `startofline` and then write. – Harvey Nov 19 '15 at 19:16
  • 1
    Should be `startofline = mm.rfind(b'\n', 0, len(mm) - 1) + 1` (note the +1) to preserve the previous newline. It also has the accidental effect of removing the need for testing for not found. – Harvey Nov 19 '15 at 19:21
  • 1
    @Harvey: Thanks! I've provided an alternative bit of code for systems without `mmap.resize` support, and fixed up the `startofline` calculation. Damn you off-by-one errors! – ShadowRanger Nov 19 '15 at 19:25
  • 1
    Thanks so much for the code example and detailed explanation. Like you mentioned, I ran into `ValueError: mmap length is too large` issue. – TTT Nov 19 '15 at 20:09
  • 1
    @tao.hong: If you're on a 32 bit system, the code could be adjusted. Basically, you'd get the file size, then `mmap` starting at, say, 100 MB from end of file (which should be more than long enough to contain the final line), calculating `offset=filesize - (100 * 1024 ** 2)` and passing `offset=offset` to `mmap`. You'd then use the code that doesn't use `mmap.resize` (because you haven't mapped the whole file), and adjust the `seek` call to use `startofline + offset` so it seeks to the correct place in the larger file. – ShadowRanger Nov 19 '15 at 20:15
  • Thanks so much for the suggestion! – TTT Nov 19 '15 at 20:24
  • 1
    @tao.hong: I've added example code to the end of my answer. It needed one tweak from the version suggested in my comments, to ensure that the `offset` was properly aligned (you can't `mmap` at uneven offsets) and to ensure it didn't go below 0. – ShadowRanger Nov 19 '15 at 20:25
  • @ShadowRanger:Thanks so much and I really appreciate the follow up! – TTT Nov 19 '15 at 23:47
  • 3
    I honestly don't know if this is better, worse, or the same as the accepted answer on the duplicate, but I think you should consider adapting this as an answer to that question (in addition to this). That is, assuming it's not already there. That question has 19 answers and I have not read through them all. – Two-Bit Alchemist Nov 20 '15 at 03:50
  • 1
    @Two-BitAlchemist: Agreed. This answer is much better than the ones for the duplicate. ShadowRanger should post his answer there, too. Adding a `numlines` parameter (equal to 1 for this problem) should make it work for the duplicate: `startofline = len(mm) - 1; for _ in range(numlines): startofline = mm.rfind(b'\n', 0, startofline); if startofline < 0: break; startofline += 1 ` – Harvey Nov 20 '15 at 14:38
  • @Two-BitAlchemist: I [made an answer there](http://stackoverflow.com/a/34029605/364696), feel free to critique. – ShadowRanger Dec 01 '15 at 20:39
  • @Harvey: Definitely critique: you're good at catching my stupid. :-) – ShadowRanger Dec 01 '15 at 20:40
2

Update: Use ShadowRanger's answer. It's much shorter and robust.

For posterity:

Read the last N bytes of the file and search backwards for the newline.

#!/usr/bin/env python

with open("test.txt", "wb") as testfile:
    testfile.write('\n'.join(["one", "two", "three"]) + '\n')

with open("test.txt", "r+b") as myfile:
    # Read the last 1kiB of the file
    # we could make this be dynamic, but chances are there's
    # a number like 1kiB that'll work 100% of the time for you
    myfile.seek(0,2)
    filesize = myfile.tell()
    blocksize = min(1024, filesize)
    myfile.seek(-blocksize, 2)
    # search backwards for a newline (excluding very last byte
    # in case the file ends with a newline)
    index = myfile.read().rindex('\n', 0, blocksize - 1)
    # seek to the character just after the newline
    myfile.seek(index + 1 - blocksize, 2)
    # read in the last line of the file
    lastline = myfile.read()
    # modify last_line
    lastline = "Brand New Line!\n"
    # seek back to the start of the last line
    myfile.seek(index + 1 - blocksize, 2)
    # write out new version of the last line
    myfile.write(lastline)
    myfile.truncate()
Community
  • 1
  • 1
Harvey
  • 5,703
  • 1
  • 32
  • 41
  • Probably want to use `rfind`, not `rindex`, or you'll handle single line files by throwing an exception, when you could just rewrite the single line. Suppose it depends on whether multiple lines are known to exist. – ShadowRanger Nov 19 '15 at 19:29
  • @ShadowRanger: I started to do that, but you don't know if you really found the start of line or just the beginning of block. I'm recommending your answer while leaving mine around for people to see. – Harvey Nov 19 '15 at 19:31
  • 1
    Ah, right. Forgot about the block read. Big advantage to `mmap` is that you don't need to worry about that sort of thing. :-) – ShadowRanger Nov 19 '15 at 19:32
  • This has the catch that you need to establish an *n* which is guaranteed to be large enough to always include the final newline, or arrange a fallback to some other approach (repeating with bigger and bigger blocks of *n* is probably not a great fallback strategy). – tripleee Nov 19 '15 at 19:36
  • 1
    @tripleee: agreed. That's why I'm recommending the mmap method. I always forget about mmap. – Harvey Nov 19 '15 at 19:37