I am in the proces of preparing a corpus of textfiles, consisting of 170 Dutch novels. I am a literary scholar and relatively new to Python, and also to programming in general. What I am trying to do is writing a Python script for removing everything from each .txt file that does NOT belong to the actual content of the novel (i.e. the story). Things I want to remove are: added biographies of the author, blurbs, and other pieces of information that comes with converting an ePub to .txt.
My idea is to manually decide for every .txt file at which line the actual content of the novel begins and where it ends. I am using the following block of code for the purpose of removing every information in the .txt file that is not contained between those two line numbers:
def removeparatext(inputFilename, outputFilename):
inputfile = open(inputFilename,'rt', encoding='utf-8')
outputfile = open(outputFilename, 'w', encoding='utf-8')
for line_number, line in enumerate(inputfile, 1):
if line_number >= 80 and line_number <= 2741:
outputfile.write(inputfile.readline())
inputfile.close()
outputfile.close()
removeparatext(inputFilename, outputFilename)
The numbers 80 and 2741 are the start and end numbers for the actual content of one specific novel. However, the outputfile only outputs a .txt file with the text removed BEFORE linenumber 80, it still contains everyhing AFTER line number 2741. I do not seem to understand why. Perhaps I am not using the enumerate() function in the right way.
Another thing is that I would like to get rid of all unnecessary spaces in the .txt-file. But the .strip() method does not seem to work when I implement it in this block of code.
Could anyone give me a suggestion as to how to solve this problem? Many thanks in advance!