I have millions of files being created each hour. Each file has one line of data. These files need to be merged into a single file.
I have tried doing this in the following way:-
- Using aws s3 cp to download files for the hour.
- Use a bash command to merge the files. OR
- Use a python script to merge the files.
This hourly job is being run in Airflow on Kubernetes(EKS). This takes more than one hour to complete and is creating a backlog. Other problem is that it often causes the EC2 Node to stop responding due to high CPU and memory usage. What is the most efficient way of running this job?
The python script for reference:-
from os import listdir
import sys
# from tqdm import tqdm
files = listdir('./temp/')
dest = sys.argv[1]
data = []
tot_len = len(files)
percent = tot_len//100
for i, file in enumerate(files):
if(i % percent == 0):
print(f'{i/percent}% complete.')
with open('./temp/'+file, 'r') as f:
d = f.read()
data.append(d)
result = '\n'.join(data)
with open(dest, 'w') as f:
f.write(result)