yield doesn't return file content from big files, why?

Question

I'm trying to multiprocess one of my scripts. The script reads all data from file, and write to another file the relevant lines (for example, all students who's first name is Jacob).

The original script was:

search_val = "jacob"
with open("big_file.txt") as f:
    with open("matches.txt", "w") as f_out:
        for line in f:
            if (search_val in line.lower()):
                f_out.write(line)

This script and the multiprocess both generate good results for small files, but original script also works on bib files.

The multiprocess script is:

from multiprocessing import Pool
import threading
import time
import Queue
import sys

search_val = {"key1" : [], "key2" : [], "key3":[]}

def process_line(line):
    global search_val
    key_val_list = []
    for key in search_val.keys():
        if (key.lower() in line.lower()):
            search_val[key].append(line.strip())
            key_val_list.append({key:line})
    return key_val_list

#with open("big_file.txt") as f:
def get_lines():
    with open("small_file.txt") as f:
        yield f


if __name__ == "__main__":
    pool = Pool(8)
    file_lines = get_lines()
    start = time. time()
    end = time. time()
    #print(end - start)
    results = pool.map(process_line, next(file_lines), 8)
    #pool.close()
    #print(results)

    print("Done reading")
    end = time. time()
    print(end - start)

    with open("results.txt", "w") as f_out:
        f_out.write(str(results))

    print("Done saving results")
    end = time. time()
    print(end - start)

    print_dict= {}
    for line in results:
        for result in line:
            for key in result.keys():
                if key in print_dict.keys():
                    print_dict[key].append(result[key].strip())
                else:
                    print_dict[key] = [result[key].strip()]

    print("Done ordering")
    end = time. time()
    print(end - start)

    for key in print_dict.keys():
        with open(key+".txt", "w") as f_out:
            for val in print_dict[key]:
                f_out.write(val + "\n")

you can use this as small file:

key2@key1.co.il
key3@key1.net
key4@key1.co.uk
key5@key1.co.il

This script works fine for small files, but doesn't generate any results (doesn't even prints "Done reading") for the big_file. Big_file size is 11 GB.

I have 2 questions:

Did I use yield the way I suppose to?
Do you have any Idea why it doesn't work?

I also tried to update search_val map (which is global parameter), but that didn't work either, so I tried the list option.

If you have any ideas you are more than welcome to share.

Do you have 11GB free RAM? Append don't work as you think and is not efficient. Best solution here would be stream and not too big buffer size. Read an exaple of read stream here: https://stackoverflow.com/questions/26127889/python-read-stream — Zydnar, Nov 24 '18 at 13:36
The way you're using `yield` looks all right, so it's not clear why it wouldn't work for large files. Are you sure you're waiting long enough for it to finish because it would take a very long time to process a file that large. Suggest you try it on something smaller but still fairly big (like "only" 1 GB) and see if the same problem exists. The issue might simply be slow performance. There are several ways the code could be improved to speed things up—but first you need to establish what the problem really is. — martineau, Nov 24 '18 at 13:37
Thanks for your responses, I Don't need 11 GB RAM since to output file is very small, and I don't read all the data at once. I'm pretty sure I waited enough time, because the none-multiprocess script finish in 11-12 minutes, while this script ran for about 4 hours and didn't end. So I wonder if i'm not using multiprocessing as I should? — N0nam3, Nov 24 '18 at 13:44
No you certainly don't need that much RAM and the way you're using `multiprocessing` seems proper, as well. Note that even with a `chunksize` of 8, there's still a tremendous number of processes being created and result lists being retrieved. Using `multiprocessing` often introduces an awful lot of overhead... — martineau, Nov 24 '18 at 14:06
Thanks. Can you recommend me a different approach to achieve my goal? Since my goal was to make this script run faster (and currently i'm failing :)) Thanks — N0nam3, Nov 24 '18 at 14:14
By the way... 1GB took 2 minutes, Currently waiting for 4GB file to end (runs for more than 7 minutes alreasy) — N0nam3, Nov 24 '18 at 14:17

yield doesn't return file content from big files, why?

0 Answers0