Using Python to remove paratext (or 'noise') from txt files

Question

I am in the proces of preparing a corpus of textfiles, consisting of 170 Dutch novels. I am a literary scholar and relatively new to Python, and also to programming in general. What I am trying to do is writing a Python script for removing everything from each .txt file that does NOT belong to the actual content of the novel (i.e. the story). Things I want to remove are: added biographies of the author, blurbs, and other pieces of information that comes with converting an ePub to .txt.

My idea is to manually decide for every .txt file at which line the actual content of the novel begins and where it ends. I am using the following block of code for the purpose of removing every information in the .txt file that is not contained between those two line numbers:

def removeparatext(inputFilename, outputFilename):
    inputfile = open(inputFilename,'rt', encoding='utf-8')
    outputfile = open(outputFilename, 'w', encoding='utf-8')

    for line_number, line in enumerate(inputfile, 1):
        if line_number >= 80 and line_number <= 2741: 
            outputfile.write(inputfile.readline())

    inputfile.close()
    outputfile.close()

removeparatext(inputFilename, outputFilename)

The numbers 80 and 2741 are the start and end numbers for the actual content of one specific novel. However, the outputfile only outputs a .txt file with the text removed BEFORE linenumber 80, it still contains everyhing AFTER line number 2741. I do not seem to understand why. Perhaps I am not using the enumerate() function in the right way.

Another thing is that I would like to get rid of all unnecessary spaces in the .txt-file. But the .strip() method does not seem to work when I implement it in this block of code.

Could anyone give me a suggestion as to how to solve this problem? Many thanks in advance!

score 1 · Accepted Answer · edited May 23 '17 at 12:06

enumerate already provides the line alongside its index, so you don't need call readline on the file object again as that would lead to unpredictable behavior - more like reading the file object at a double pace:

for line_number, line in enumerate(inputfile, 1):
    if line_number >= 80 and line_number <= 2741: 
        outputfile.write(line)
#                        ^^^^

As an alternative to using enumerate and iterating through the entire file, you may consider slicing the file object using itertools.islice which takes the start and stop indices, and then writing the sliced sequence to the output file using writelines:

from itertools import islice

def removeparatext(inputFilename, outputFilename):
    inputfile = open(inputFilename,'rt', encoding='utf-8')
    outputfile = open(outputFilename, 'w', encoding='utf-8')

    # use writelines to write sliced sequence of lines 
    outputfile.writelines(islice(inputfile, 79, 2741)) # indices start from zero

    inputfile.close()
    outputfile.close()

In addition, you can open files and leave the closing/cleanup to Python by using a context manager with the with statement. See How to open a file using the open with statement.

from itertools import islice

def removeparatext(inputFilename, outputFilename):
    with open(inputFilename,'rt', encoding='utf-8') as inputfile,\
         open(outputFilename, 'w', encoding='utf-8') as outputfile:    
        # use writelines to write sliced sequence of lines 
        outputfile.writelines(islice(inputfile, 79, 2741))


removeparatext(inputFilename, outputFilename)

Thank you so much! Using itertools.islice works just fine for me. I was already aware of using the with statement for opening files, but I was not sure how to use it while opening two files instead of one. — roelmetgevoel, Oct 14 '16 at 09:00

Using Python to remove paratext (or 'noise') from txt files

1 Answers1