Process large number of files: getting slow down

Question

I have about 600,000 text files in the "./data" directory. All of them are of 1 line. I want to merge them into 1 file, where each lines should be enclosed by single quote '.

I write a python script like following:

#!/usr/bin/env python3

from glob import glob

def main():
    files = glob("data/*")
    for f in files:
        with open(f) as f2:
            print("'" + f2.read() + "'")

if __name__ == "__main__":
    main()

Saving this as merge.py, I can get a merged file with the command

./merge.py > merged.txt

At first, as a efficiency test, I run the code where for f in files is replaced with for f in files[:10000]. It finished in few seconds. And I thought if I run it for whole files (i.e. with the original for f in files line), it will finish in a several minutes. Then I modify the line and ran it. But it did not finish even after 15 minutes. I wondered. I opened another terminal and ran

while true; do date; wc -l merged.txt; sleep 300; done

According to the output of this command, my script processed about 20k files per 5 minutes (this is much smaller than I expect) and got slower the process goes.

My script just repeatedly opens a file, write a line to standard output, close the file. To my understanding it makes no difference if it is around the beginning of the loop or after processed hundreds of thousands.

Is there any reason that makes the process slower ?

Could you try opening the output file in python (before the for loop) and writing to it directly instead of redirecting stdout to the file? I wonder if the slowdown may be caused by using `print` — GPhilo, Oct 14 '20 at 13:09
I am not sure what is problem here but why are you piping the data when you can write to a merged file in append mode? — sai, Oct 14 '20 at 13:33
https://stackoverflow.com/questions/3857052/why-is-printing-to-stdout-so-slow-can-it-be-sped-up — dosas, Oct 15 '20 at 05:39

Process large number of files: getting slow down

0 Answers0