0

this code works perfectly the only problem it doesnt work with Large txt files . 1GB text file . What can I do to fix?

import os

file_1 = open('file1.txt', 'r', encoding='utf8').read().splitlines()
file_2 = open('file2.txt', 'r', encoding='utf8').read().splitlines()
[file_2.remove(l) for l in file_1 if l in file_2]
with open('file2.txt', 'w') as new_file:
    [new_file.write(l + '\n') for l in file_2]
Stew
  • 41
  • 3
  • 1
    Can you define "doesnt work with large files"? do you mean you get a memory error (since you read both files into memory)? or do you mean its just really slow (since your calling `remove()` method of list. Are these files already sorted in order before you porcess them in this script? – Chris Doyle Nov 03 '19 at 11:30
  • That just takes some time in computing, it is natural, but there are ways computing time can be reduced. – Hisham___Pak Nov 03 '19 at 11:31
  • @ChrisDoyle (result, consumed) = self._buffer_decode(data, self.errors, final) MemoryError – Stew Nov 03 '19 at 11:34
  • 1
    yeah so this happens because you try to read the entire file into memory. So for large files which are bigger than your memory this is impossible. as you wont have enough memory to hold the entire file contents. Instead you need to rethink how to appraoch this problem in a way that allows you to read the file either line by line or in chunks – Chris Doyle Nov 03 '19 at 11:36
  • @ChrisDoyle so whats the solution – Stew Nov 03 '19 at 11:39
  • well as we dont know your aim / objective its hard to say. As i already asked are both these files sorted. It seems you aim is to remove lines from file2 which exist in file 1 such that file 2 continas only lines that dont exist in file1. you dont mention if these files are sorted / ordered as that changes the complexity and runtime of such a problem – Chris Doyle Nov 03 '19 at 11:40
  • you could have a look at https://stackoverflow.com/a/57287702/1212401 as its a similar question – Chris Doyle Nov 03 '19 at 11:45

2 Answers2

1

You need to read the files without saving the content in memory. You can do it by using with on the input files

with open(r'C:\Users\Guy.SA\Desktop\fileB.txt', 'r') as file_2, open(r'C:\Users\Guy.SA\Desktop\fileC.txt', 'w') as new_file:
    for line_2 in file_2:
        with open(r'C:\Users\Guy.SA\Desktop\fileA.txt', 'r') as file_1:
            for line_1 in file_1:
                if line_1 == line_2:
                    break
            else:
                new_file.write(line_2)
Guy
  • 46,488
  • 10
  • 44
  • 88
0

You should use file object for this:

with open('file1.txt', 'r', encoding='utf8') as file_1,
     open('file2.txt', 'r', encoding='utf8') as file_2:

    for line in file1: # or file 2
        # Do what you need to do with reading it line by line

Also note that:

with will close file automatically after.

Mehrdad Pedramfar
  • 10,941
  • 4
  • 38
  • 59