0

My attempt to remove arrow character, blank lines and headers from this text file is as below -

I am trying to ignore arrow character and blank lines and write in the new file MICnew.txt but my code doesn't do it. Nothing changes in the new file. Please help, Thanks so much I have attached sample file as well.

import re
with open('MIC.txt') as oldfile, open('MICnew.txt', 'w') as newfile:
    for line in oldfile:
        newfile.write(re.sub(r'[^\x00-\x7f]',r' ',line))

with open('MICnew.txt','r+') as file:
    for line in file:
        if not line.isspace():
            file.write(line)
Shri
  • 156
  • 11

1 Answers1

1

You can't read from and write to the same file simultaneously. When you open a file with mode r+, the I/O pointer is initially at the beginning but reading will push it to the end (as explained in this answer). So in your case, you read the first line of the file, which moves the pointer to the end of the file. Then you write out that line (unless it's all whitespace) but crucially, the pointer stays at the end. That means on the next iteration of the loop you will have reached the end of the file and your program stops.

To avoid this, read in all the contents of the file first, then loop over them and write out what you want:

file_data = Path('MICnew.txt').read_text()

with open('MICnew.txt', 'w') as out_handle: # THIS WILL OVERWRITE THE FILE!
    for line in file_data.splitlines():
        if not line.isspace():
            file.write(line)

But that double loop is a bit clumsy and you can instead combine the two steps into one:

with open('MIC.txt', errors='ignore') as oldfile,
     open('MICnew.txt', 'w') as newfile:

    for line in oldfile:
        clean_line = re.sub(r'[^\x00-\x7f]', ' ', line.strip('\x0c'))
        if not clean_line.isspace():
            newfile.write(clean_line)

In order to remove non-Unicode characters, the file is opened with errors='ignore' which will omit the improperly encoded characters. Since the sample file contains a number of rogue form feed characters throughout, it explicitly removes them (ASCII code 12 or \x0c in hex).

Jan Wilamowski
  • 3,308
  • 2
  • 10
  • 23
  • Thanks Jan. I tried the second part of your answer(merged loops) but it did not remove special character or blank line. Could you try your code in the sample file I have attached with the question (hyperlink) Thanks – Shri Nov 18 '21 at 00:18
  • 1
    @CodeBot I updated my answer and tested against your sample file. Since it has a number of formfeed characters, I remove them explicitly. – Jan Wilamowski Nov 19 '21 at 03:27
  • Superb! This worked now. Thanks @Jan Really appreciate it. I have marked it as answer. – Shri Nov 19 '21 at 10:26