1

I would like to create some 50000 files using python and they are very simple files with each file with less than 20 lines in it.

I first tried adding threading just for the sake of it and it took 220 seconds on my i7 8th gen machine.

WITH THREAD


def random_files(i):
    filepath = path+"/content/%s.html" %(str(i))
    fileobj = open(filepath,"w+")
    l1 = "---\n"
    l2 = 'title: "test"\n'
    l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
    l4 = "draft: false"+"\n"
    l5 = 'type: "statecity"'+"\n"
    l6 = "---"+"\n"
    data = l1+l2+l3+l4+l5+l6
    fileobj.writelines(data)
    fileobj.close()

if __name__ == "__main__":
    start_time = time.time()
    for i in range(0, 50000):
        i = str(i)
        threading.Thread(name='random_files', target=random_files, args=(i,)).start()
    print("--- %s seconds ---" % (time.time() - start_time))

WITHOUT THREAD

Doing the non thread route takes 55 seconds.

def random_files():
    for i in range(0, 50000):
        filepath = path+"/content/%s.html" %(str(i))
        fileobj = open(filepath,"w+")
        l1 = "---\n"
        l2 = 'title: "test"\n'
        l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
        l4 = "draft: false"+"\n"
        l5 = 'type: "statecity"'+"\n"
        l6 = "---"+"\n"
        data = l1+l2+l3+l4+l5+l6
        fileobj.writelines(data)
        fileobj.close()

if __name__ == "__main__":
    start_time = time.time()
    random_files()
    print("--- %s seconds ---" % (time.time() - start_time))

CPU usage is 10% for python task RAM usage is a meager 50mb Disk usage is an average 4.5 Mb/second

Can the speed be increased drastically.

Alessi 42
  • 1,112
  • 11
  • 26
Prabhakar Shanmugam
  • 5,634
  • 6
  • 25
  • 36
  • 3
    The disk speed, not the amount of CPU parallelism, is the bottleneck. This is a common FAQ. As your experiment very clearly illustrates, pushing more load to the already saturated I/O channel only creates congestion. – tripleee Jun 05 '19 at 17:55
  • 1
    Profile your script and see where it's speeding most of its time. See [How can you profile a Python script?](https://stackoverflow.com/questions/582336/how-can-you-profile-a-python-script) – martineau Jun 05 '19 at 17:58
  • Are you simply writing the same data to each of the 50,000 files? What is your objective? – Alessi 42 Jun 05 '19 at 18:00
  • Thanks for all your quick replies. Actually this is only a test script. In the original the data will change. – Prabhakar Shanmugam Jun 05 '19 at 18:00
  • @tripleee I see your point. So you mean to say a faster ssd or raid disk will give better results.Is that right. And you also imply that this can not sped up much more...right – Prabhakar Shanmugam Jun 05 '19 at 18:01
  • Your threading solution also creates one thread for each of the 50,000 files instead of splitting the load over the number of threads the computer. Say your computer has 4 cores and 8 threads splitting the files evenly over those 8 threads for 6250 files each. – Alessi 42 Jun 05 '19 at 18:05
  • @Alessi42 Could you please point me in the right direction. One could say the average disk write speed for ssd these days is 200mb/second. Mine lingers around 4.5mb per second. Which raised my suspicion. Can you help point in the right direction. – Prabhakar Shanmugam Jun 05 '19 at 18:11
  • @tripleee Can the disk write speed be increased from this meager 4.5 mb/s to 200 mb/s. Is there any crazy new idea like creating files in memory or doing a bulk write of all files. Any idea would be greatly appreciated. – Prabhakar Shanmugam Jun 05 '19 at 18:13
  • The raw disk speed, the OS I/O scheduler, and the file system are places you can try to tweak. A ramdisk will be the fastest if you have the memory to spare, but of course it's volatile storage which will have to be committed to more permanent media eventually if you need to keep it. – tripleee Jun 05 '19 at 18:22
  • You should be using `.write()` instead of `.writelines()` to write your string to the file - the latter *works*, but it's very inefficient since it's iterating over the string and writing each character individually. The normal input data for `.writelines()` is a list of strings. – jasonharper Jun 05 '19 at 20:05

1 Answers1

1

Try threading with the load split equally across each of the threads in your system.

This provides an almost linear speed up for the number of threads that the load is split across:

Without Threading:

~11% CPU ~5MB/s Disk

--- 69.15089249610901 seconds ---


With Threading: 4 Threads

22% CPU 13MB/s Disk

--- 29.21335482597351 seconds ---


With Threading: 8 Threads

27% CPU 15MB/s Disk

--- 20.8521249294281 seconds ---


For example:

import time
from threading import Thread

def random_files(i):
    filepath = path+"/content/%s.html" %(str(i))
    fileobj = open(filepath,"w+")
    l1 = "---\n"
    l2 = 'title: "test"\n'
    l3 = "date: 2019-05-01T18:37:07+05:30"+"\n"
    l4 = "draft: false"+"\n"
    l5 = 'type: "statecity"'+"\n"
    l6 = "---"+"\n"
    data = l1+l2+l3+l4+l5+l6
    fileobj.writelines(data)
    fileobj.close()

def pool(start,number):
    for i in range(int(start),int(start+number)):
        random_files(i)

if __name__ == "__main__":
    start_time = time.time()
    num_files = 50000
    threads = 8
    batch_size = num_files/threads
    thread_list = [Thread(name='random_files', target=pool, args=(batch_size * thread_index ,batch_size)) for thread_index  in range(threads)]
    [t.start() for t in thread_list]
    [t.join() for t in thread_list] // simply required to wait for each of the threads to finish before stopping the timer

    print("--- %s seconds ---" % (time.time() - start_time))

The solution provided here, however, is only an example to show the speed increase which can be achieved. The method of splitting the files into batches only works as 50,000 files can be evenly divided into 8 batches (one for each thread), a more robust solution will be required with the pool() function to split the load into batches.

Try taking a look at this SO example of splitting an uneven load across threads for an example.

Hope this helps!

Alessi 42
  • 1,112
  • 11
  • 26