1

I have a 30gb txt file and i want to filter some data:

output = open ('output.txt', 'a')

with open ('30gbfile.txt') as f:
    for line in f:
        if 'xxx' in line:
            output.write(line)

output.close()

No error occurs, but the execution stops after few seconds and output.txt contains only the 3% of data that had to be extracted.

user2490424
  • 419
  • 1
  • 4
  • 9
  • 1
    Off-topic, but are you sure you don't need a database? a 30gb file that you filter doesn't sound maintainable... – Sayse Feb 28 '17 at 09:18
  • This is I/O bound so there isn't much we can do. – Alex Fung Feb 28 '17 at 09:19
  • Your code should work, but slightly more pythonic would be `with open ('30gbfile.txt') as in_f, open('output.txt','a') as out_f:` – Chris_Rands Feb 28 '17 at 09:20
  • Cannot imagine how or why such a simple script could cause problems. If there are no error messages, it is likely that it has actually copied the only lines where the pattern was found. To be sure, you should store in a 3rd file the other lines `output_else = open('else.txt', 'w') ... if ... else: output_else.write(line)` – Serge Ballesta Feb 28 '17 at 09:22
  • I found this great blog to process huge file using multiprocessing. http://www.blopig.com/blog/2016/08/processing-large-files-using-python/ Hope it helps – Kruupös Feb 28 '17 at 09:31
  • Just to be able to assert that you can read the whole file, initialize a counter: `n=0`, loop over the lines incrementing the counter: `for line in f: n += 1` and eventually `print(n)` – gboffi Feb 28 '17 at 09:56
  • I don't understand why for this guy works correctly: https://youtu.be/i2DHWxtRqpE?t=944 – user2490424 Feb 28 '17 at 10:03
  • 1
    Could you please edit your Q to explain on which bases you assert that only 3% of the expected output has been produced? – gboffi Feb 28 '17 at 10:06
  • When you say "the execution stops after few seconds" do you mean that the program terminates normally and returns you to the command prompt? Or does it just appear to hang? – PM 2Ring Feb 28 '17 at 10:20
  • Actually only the 0.0089 % of the expected output has been produced cause 'xxx' appears in at least 1milion lines of 350 and the output file contains only 89 lines. It says "finished 14s." – user2490424 Feb 28 '17 at 10:22
  • http://stackoverflow.com/questions/28643919/python-string-processing-optimization/28644327 – Antti Haapala -- Слава Україні Feb 28 '17 at 10:36
  • If you have a 64-bit machine you should definitely try Antti's suggestion of `mmap`. Also note that `mmap` objects have a [readline](https://docs.python.org/3/library/mmap.html#mmap.mmap.readline) method. – PM 2Ring Feb 28 '17 at 11:38

0 Answers0