I have about 600,000 text files in the "./data" directory. All of them are of 1 line.
I want to merge them into 1 file, where each lines should be enclosed by single quote '
.
I write a python script like following:
#!/usr/bin/env python3
from glob import glob
def main():
files = glob("data/*")
for f in files:
with open(f) as f2:
print("'" + f2.read() + "'")
if __name__ == "__main__":
main()
Saving this as merge.py, I can get a merged file with the command
./merge.py > merged.txt
At first, as a efficiency test, I run the code where for f in files
is replaced with for f in files[:10000]
. It finished in few seconds. And I thought if I run it for whole files (i.e. with the original for f in files
line), it will finish in a several minutes. Then I modify the line and ran it. But it did not finish even after 15 minutes. I wondered. I opened another terminal and ran
while true; do date; wc -l merged.txt; sleep 300; done
According to the output of this command, my script processed about 20k files per 5 minutes (this is much smaller than I expect) and got slower the process goes.
My script just repeatedly opens a file, write a line to standard output, close the file. To my understanding it makes no difference if it is around the beginning of the loop or after processed hundreds of thousands.
Is there any reason that makes the process slower ?