1

I am trying to process some files using python, however as file number is huge it's taking too much time. I am trying to create multiple threads and wants to execute this thing in parallel to cut down some time. However not sure exactly how to do it.

I have write the following code which is suppose to execute 10 files in parallel, but it seems like rather than creating 10 threads it's creating 100 threads, one for each file.

    def setup_logging():
    log_formatter = logging.Formatter('%(asctime)s [%(threadName)s] [%(levelname)s] %(message)s')
    root_logger = logging.getLogger()

    file_handler = logging.FileHandler("./logs.log")
    file_handler.setFormatter(log_formatter)
    root_logger.addHandler(file_handler)

    console_handler = logging.StreamHandler()
    console_handler.setFormatter(log_formatter)
    root_logger.addHandler(console_handler)
    root_logger.level = logging.DEBUG


def print_file_name(name):
    logging.info(name)


if __name__ == '__main__':
    setup_logging()
    logging.info("hi")

    dir_name = "/home/egnyte/demo/100"
    file_list = os.listdir(dir_name)
    threads = []
    import threading
    for i in range(0, len(file_list), 10):
        for index in range(0, 10, 1):
            t = threading.Thread(target=print_file_name, args=(file_list[i+index],))
            threads.append(t)
            t.start()

        for t in threads:
            t.join()

Now the problem is, in logs I am able to see following lines, which makes me think it's creating more than 10 thread, actually 1 for every file. And that's not what I want.

2017-03-30 13:16:46,120 [Thread-9] [INFO] demo_69.txt
2017-03-30 13:16:46,120 [Thread-10] [INFO] demo_45.txt
2017-03-30 13:16:46,121 [Thread-11] [INFO] demo_72.txt
2017-03-30 13:16:46,121 [Thread-12] [INFO] demo_10.txt
...
...
2017-03-30 13:16:46,149 [Thread-98] [INFO] demo_29.txt
2017-03-30 13:16:46,150 [Thread-99] [INFO] demo_27.txt
2017-03-30 13:16:46,150 [Thread-100] [INFO] demo_39.txt

I tried using multi process as well, however seems like it's not creating any thread, all the file name are being printed using main thread only.

pool = multiprocessing.Pool(processes=10) result_list = pool.map(print_file_name, (file for file in os.listdir(dir_name)))

Gaurang Shah
  • 11,764
  • 9
  • 74
  • 137

1 Answers1

0

You are creating a thread for each file:

for i in range(0, len(file_list), 10):
    for index in range(0, 10, 1):
        t = threading.Thread(target=print_file_name, args=(file_list[i+index],))
        threads.append(t)
        t.start()

(Note: you should use for file in file_list to loop through the file list)

Would using a thread pool as in the following Answer be a better solution:

parallel file parsing, multiple CPU cores

Community
  • 1
  • 1
Robben_Ford_Fan_boy
  • 8,494
  • 11
  • 64
  • 85
  • I tried using multi process as well, but it seems that it's creating only single thread for all the files. I have updated the question as well. – Gaurang Shah Mar 31 '17 at 07:08