I have 100 big files and each is about 5GB. I need to split them into files based on its contents. The big files have much lines, each line is like this
{"task_op_id": 143677789, "task_op_time": 1530927931, "task_op_tag": 1, "create_time": 1530923701, "status": 2}
and I need to split the content based on the task_op_id, every big file have 350 different task_op_id, so each should generated 350 different small files, each has the same task_op_id contents.
My tried method is:
def split_to_id_file(original_file):
destination_file = 'processed_data2/data_over_one_id/break_into_ids/'
with open(original_file) as f1:
for line in f1:
data_dict = json.loads(line)
task_op_id = data_dict['task_op_id']
with open(destination_file+str(task_op_id), 'a+') as f2:
json.dump(data_dict, f2, ensure_ascii=False)
f2.write('\n')
# multiprocessing with pool
def multiprocessing_pool(workers_number, job, files_list):
p = Pool(workers_number)
p.map(job, files_list)
def main():
input_path = 'processed_data2/data_over_one_id'
files_list = [join(input_path, f) for f in listdir(input_path)
if isfile(join(input_path, f))
and join(input_path, f).split('/')[-1].startswith('uegaudit')]
multiprocessing_pool(80, split_to_id_file, files_list)
if __name__ == '__main__':
main()
But the speed is too low, processing 10GB data needs 2hours.
So is there a better way to process the data?
Thank you very much for helping.