2

I have several files to iterate through, some of them several million lines long. One file can have more than 500 MB. I need to prep them by searching and replacing '| |' string with '|' string.

However, the following code runs into a "Memory error". How to rework the code to search and replace the files by line to save RAM? Any ideas? This is not about reading the large file line by line as rather replacing string line by line and avoiding issue with transforming list into string and vice versa.

import os
didi = self.lineEdit.text()
for filename in os.listdir(didi):            
    if filename.endswith(".txt"):
        filepath = os.path.join(didi, filename)
        with open(filepath, errors='ignore') as file:
            s = file.read()
            s = s.replace('| |', '|')
        with open(filepath, "w") as file:
               file.write(s)
Gino Mempin
  • 25,369
  • 29
  • 96
  • 135
Kokokoko
  • 452
  • 1
  • 8
  • 19
  • 1
    Process the file in chunks. Open an input and output file, read only some N number of characters at a time (using the optional argument to `read()`). Since you are looking for a particular pattern you may also need some additional logic to handle the boundaries between reads. – Iguananaut Oct 08 '19 at 12:17
  • 2
    Possible duplicate of [How to read a large file line by line](https://stackoverflow.com/questions/8009882/how-to-read-a-large-file-line-by-line) – stovfl Oct 08 '19 at 12:17
  • You might also save yourself some trouble by using an existing program dedicated to this task like `sed`. It will be faster and less error-prone, most likely. – Iguananaut Oct 08 '19 at 12:18
  • I would also make sure you are using the 64bit version of Python: https://stackoverflow.com/questions/1405913/how-do-i-determine-if-my-python-shell-is-executing-in-32bit-or-64bit-mode-on-os There is some kind of lower memory cap if you are using 32bit. – sniperd Oct 08 '19 at 12:41

2 Answers2

3

Try the following code:

chunk_size = 5000
buffer = ""
i = 0

with open(fileoutpath, 'a') as fout:
    with open(fileinpath, 'r') as fin:
        for line in fin:
            buffer += line.replace('| |', '|')
            i+=1
            if i == chunk_size:
                    fout.write(buffer)
                    i=0
                    buffer = ""
    if buffer:
        fout.write(buffer)
        i=0
        buffer = ""

This code reads one line at a time in memory.

It stores the results in a buffer, which at most will contain chunk_size lines at a time, after which it saves the result to file and cleans the buffer. And so it goes on until the end of the file. At the end of the reading loop, if the buffer contains lines, it is written to disk.

In this way, in addition to checking the number of lines in memory, you also check the number of disk writes. Writing to files every time you read a line may not be a good idea, as well as having a chunk_size too large. It's up to you to find a chunk_size value that fits your problem.

Note: You can use the open() buffering parameter, to get the same result. Find everything in documentation. But the logic is very similar.

Massifox
  • 4,369
  • 11
  • 31
1

Try reading the file in line-by-line, instead of one giant chunk. I.e.

with open(writefilepath, "w", errors='ignore') as filew:
    with open(readfilepath, "r", errors='ignore') as filer:
       for line in filer:
           print("Line {}: {}".format(cnt, line.strip()))
           line = line.replace('| |', '|')
           filew.write(line)
RightmireM
  • 2,381
  • 2
  • 24
  • 42