I want to remove empty lines in a large text file with Python 3 and I have a one-line which works fine most of the time but not for large files:
open('path_to/dest_file.csv', 'w').write(re.sub('\n\s*\n+', '\n', open('path_to/source_file.csv').read()))
sometimes results in a
Traceback (most recent call last):
File "/scripts/dwh_common.py", line 261, in merge_files
open(out_path, 'w').write(re.sub('\n\s*\n+', '\n', open(tmp_path).read()))
File "/usr/lib64/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError
I am aware that I could use sed
instead but I want to avoid calls to OS executables if possible. I am also aware that I could split the file to be processed or increase the main memory, which I'm trying to avoid too. Does someone have an idea to solve that more memory efficient in Python?
Edit: What makes this question different from others is not that I want to know how to delete all blank lines in a file with the help of python but how to do this more memory efficient.
Answer: As pointed out by @not_a_robot and @bruno desthuilliers reading line by line instead of reading the whole file into memory solved the issue. Used the answer from this question:
with open(tmp_path) as f, open(out_path, 'w') as outfile:
for line in f.readlines():
if not line.strip():
continue
if line:
outfile.write(line)