26

I am working with a very large (~11GB) text file on a Linux system. I am running it through a program which is checking the file for errors. Once an error is found, I need to either fix the line or remove the line entirely. And then repeat...

Eventually once I'm comfortable with the process, I'll automate it entirely. For now however, let's assume I'm running this by hand.

What would be the fastest (in terms of execution time) way to remove a specific line from this large file? I thought of doing it in Python...but would be open to other examples. The line might be anywhere in the file.

If Python, assume the following interface:

def removeLine(filename, lineno):

Thanks,

-aj

AJ.
  • 27,586
  • 18
  • 84
  • 94
  • 3
    Using grep -v would likely to be quicker than using Python – dangerstat Feb 24 '10 at 20:55
  • Which line do you have to remove? How will you be able to identify it? The answer to this could make a big difference to the strategy. – Mark Byers Feb 24 '10 at 20:56
  • 1
    Is a scripting solution absolutely necessary? Large Text File Viewer (http://www.swiftgear.com/ltfviewer/features.html) should be able to handle the file and you can search for the correct line using Regular Expressions. – Dawson Goodell Feb 24 '10 at 20:59
  • A proper text editor (e.g. gvim) shouldn't have much troubles with a longer text file. 11GB isn't uncommon... – tux21b Feb 24 '10 at 21:01
  • Revised the question to give more details on the requirement, thanks. – AJ. Feb 24 '10 at 21:03
  • @dangerstat - what solution would you propose in grep? – AJ. Feb 24 '10 at 21:04
  • @Mark Byers - I would be getting the line number based on the output of another program. It could occur anywhere in the file. – AJ. Feb 24 '10 at 21:08
  • @OSMman - using Linux, revised my question. Thanks. – AJ. Feb 24 '10 at 21:08
  • @AJ cat the file and pipe into in to grep -v with the string / line you want to ignore cat file | grep -v "meh" > filteredFile Here filteredFile will not include any line containing "meh". Grep is usually highly efficient and hence might give you much improved performance over a similar method implemented in Python – dangerstat Feb 24 '10 at 21:22
  • @dangerstat - thanks, but i'm not deciding what line to remove based on matching a pattern. i already know the exact line number to remove. – AJ. Feb 24 '10 at 21:26
  • AJ: sed does exactly what you need. Look at the `d` command. – Mark Byers Feb 24 '10 at 21:27
  • Fastest would be to update the file in place, replacing the line with whitespace, is that acceptable? then `mmap` is the way to go – John La Rooy Feb 24 '10 at 22:08
  • Instead of repeating the process, is it possible to do it all in one pass? That should be a lot more efficent – John La Rooy Feb 24 '10 at 22:40
  • Instead of removing the line, create a new file for line numbers of deleted lines, and store the line number in this file. The next time you read the file, pretend that the deleted line isn't there. – Mark Byers Feb 24 '10 at 23:14

9 Answers9

15

You can have two file objects for the same file at the same time (one for reading, one for writing):

def removeLine(filename, lineno):
    fro = open(filename, "rb")

    current_line = 0
    while current_line < lineno:
        fro.readline()
        current_line += 1

    seekpoint = fro.tell()
    frw = open(filename, "r+b")
    frw.seek(seekpoint, 0)

    # read the line we want to discard
    fro.readline()

    # now move the rest of the lines in the file 
    # one line back 
    chars = fro.readline()
    while chars:
        frw.writelines(chars)
        chars = fro.readline()

    fro.close()
    frw.truncate()
    frw.close()
K. Brafford
  • 3,755
  • 2
  • 26
  • 30
  • What does truncate do without args? The python documentation isn't very clear. – James McMahon Nov 20 '11 at 06:12
  • @JamesMcMahon: What exactly is not clear about the docs? "Truncate the file's size. If the optional size argument is present, the file is truncated to (at most) that size. The size defaults to the current position." – László Papp Dec 03 '13 at 19:19
  • Although, I upvoted this question for giving some initial thought, I wrote an example with proper RAII ("with") usage with an additional variant for a search string. – László Papp Dec 04 '13 at 10:28
  • 2
    The line 'frw.writelines(chars)' should be 'frw.write(chars)' at least in Python3 – Michael SM Mar 06 '16 at 07:50
  • 1
    What prevents the writing object, frw, from conflicting with the reading object, fro? – Jo Bay Jul 30 '20 at 14:41
8

Modify the file in place, offending line is replaced with spaces so the remainder of the file does not need to be shuffled around on disk. You can also "fix" the line in place if the fix is not longer than the line you are replacing

import os
from mmap import mmap
def removeLine(filename, lineno):
    f=os.open(filename, os.O_RDWR)
    m=mmap(f,0)
    p=0
    for i in range(lineno-1):
        p=m.find('\n',p)+1
    q=m.find('\n',p)
    m[p:q] = ' '*(q-p)
    os.close(f)

If the other program can be changed to output the fileoffset instead of the line number, you can assign the offset to p directly and do without the for loop

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • 3
    A limitation here is that this won't work with a 32-bit Python build, due to mmap running out of address space at 4GB. See http://stackoverflow.com/questions/1661986/why-doesnt-pythons-mmap-work-with-large-files – Scott Griffiths Feb 25 '10 at 09:28
1

As far as I know, you can't just open a txt file with python and remove a line. You have to make a new file and move everything but that line to it. If you know the specific line, then you would do something like this:

f = open('in.txt')
fo = open('out.txt','w')

ind = 1
for line in f:
    if ind != linenumtoremove:
        fo.write(line)
    ind += 1

f.close()
fo.close()

You could of course check the contents of the line instead to determine if you want to keep it or not. I also recommend that if you have a whole list of lines to be removed/changed to do all those changes in one pass through the file.

Justin Peel
  • 46,722
  • 6
  • 58
  • 80
  • 6
    just a small comment, it is usually more convenient to use `enumerate()` in a for loop to count the iterations, as in: `for ind, line in enumerate(f):` – catchmeifyoutry Feb 24 '10 at 21:12
1

If the lines are variable length then I don't believe that there is a better algorithm than reading the file line by line and writing out all lines, except for the one(s) that you do not want.

You can identify these lines by checking some criteria, or by keeping a running tally of lines read and suppressing the writing of the line(s) that you do not want.

If the lines are fixed length and you want to delete specific line numbers, then you may be able to use seek to move the file pointer... I doubt you're that lucky though.

Dancrumb
  • 26,597
  • 10
  • 74
  • 130
  • @Dancrumb - thanks for the ideas. Unfortunately the lines/records are variable length. – AJ. Feb 24 '10 at 21:11
1

Update: solution using sed as requested by poster in comment.

To delete for example the second line of file:

sed '2d' input.txt

Use the -i switch to edit in place. Warning: this is a destructive operation. Read the help for this command for information on how to make a backup automatically.

Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • when Mark says destructive, it will really delete line 2 (2d meaning line 2, delete). you can use a combination of grep to find the line number, then delete it with sed. For instance, you want to delete the line that has the text 'danger will danger'. you can dangerline=$(grep -n 'danger will danger' | cut -d : -f 1) then sed -i "$dangerline d" . you may also need to 'cast' dangerline to an integer, you can do that by adding dangerline=$(($dangerline+0)) before that sed. – user3622356 Feb 21 '22 at 22:27
0

I will provide two alternatives based on the look-up factor (line number or a search string):

Line number

def removeLine2(filename, lineNumber):
    with open(filename, 'r+') as outputFile:
        with open(filename, 'r') as inputFile:

            currentLineNumber = 0 
            while currentLineNumber < lineNumber:
                inputFile.readline()
                currentLineNumber += 1

            seekPosition = inputFile.tell()
            outputFile.seek(seekPosition, 0)

            inputFile.readline()

            currentLine = inputFile.readline()
            while currentLine:
                outputFile.writelines(currentLine)
                currentLine = inputFile.readline()

        outputFile.truncate()

String

def removeLine(filename, key):
    with open(filename, 'r+') as outputFile:
        with open(filename, 'r') as inputFile:
            seekPosition = 0 
            currentLine = inputFile.readline()
            while not currentLine.strip().startswith('"%s"' % key):
                seekPosition = inputFile.tell()
                currentLine = inputFile.readline()

            outputFile.seek(seekPosition, 0)

            currentLine = inputFile.readline()
            while currentLine:
                outputFile.writelines(currentLine)
                currentLine = inputFile.readline()

        outputFile.truncate()
László Papp
  • 51,870
  • 39
  • 111
  • 135
0
def removeLine(filename, lineno):
    in = open(filename)
    out = open(filename + ".new", "w")
    for i, l in enumerate(in, 1):
        if i != lineno:
            out.write(l)
    in.close()
    out.close()
    os.rename(filename + ".new", filename)
Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
0

I think there was a somewhat similar if not exactly the same type of question asked here. Reading (and writing) line by line is slow, but you can read a bigger chunk into memory at once, go through that line by line skipping lines you don't want, then writing this as a single chunk to a new file. Repeat until done. Finally replace the original file with the new file.

The thing to watch out for is when you read in a chunk, you need to deal with the last, potentially partial line you read, and prepend that into the next chunk you read.

Heikki Toivonen
  • 30,964
  • 11
  • 42
  • 44
0

@OP, if you can use awk, eg assuming line number is 10

$ awk 'NR!=10' file > newfile
ghostdog74
  • 327,991
  • 56
  • 259
  • 343