1

I have 100 big files and each is about 5GB. I need to split them into files based on its contents. The big files have much lines, each line is like this

{"task_op_id": 143677789, "task_op_time": 1530927931, "task_op_tag": 1, "create_time": 1530923701, "status": 2}

and I need to split the content based on the task_op_id, every big file have 350 different task_op_id, so each should generated 350 different small files, each has the same task_op_id contents.

My tried method is:

def split_to_id_file(original_file):
    destination_file = 'processed_data2/data_over_one_id/break_into_ids/'
    with open(original_file) as f1:
        for line in f1:
            data_dict = json.loads(line)
            task_op_id = data_dict['task_op_id']
            with open(destination_file+str(task_op_id), 'a+') as f2:
                json.dump(data_dict, f2, ensure_ascii=False)
                f2.write('\n')
# multiprocessing with pool
def multiprocessing_pool(workers_number, job, files_list):
    p = Pool(workers_number)
    p.map(job, files_list)


def main():
    input_path = 'processed_data2/data_over_one_id'
    files_list = [join(input_path, f) for f in listdir(input_path)
              if isfile(join(input_path, f))
              and join(input_path, f).split('/')[-1].startswith('uegaudit')]
    multiprocessing_pool(80, split_to_id_file, files_list)


if __name__ == '__main__':
    main()

But the speed is too low, processing 10GB data needs 2hours.

So is there a better way to process the data?

Thank you very much for helping.

c-x-berger
  • 991
  • 12
  • 30
georw
  • 43
  • 6
  • 2
    At least not reformatting the line from JSON would speed things up probably. Just write `line` into the new file. – Sami Kuhmonen Dec 18 '18 at 12:48
  • 1
    You could try using Pandas library to read chunks of the data and then organize it. Btw, if your code already works and you want to improve it, the right place to go is **Code Review** (https://codereview.stackexchange.com). Stack Overflow's focus is to fix non-working code :) – Pedro Martins de Souza Dec 18 '18 at 12:50
  • 1
    Each line you open and close the destination file. Write in batches, will speed thing up. – Aaron_ab Dec 18 '18 at 12:52

2 Answers2

0

I speculate that the major time taking process is file IO operations. Can you break down the running time and check that?

Another reason could be the JSON parser. Check out this thread for more information.

Divyanshu Srivastava
  • 1,379
  • 11
  • 24
0

Can you sort these files? If yes, try don't parse every line as JSON, only these with new ID.

Something like this?

def get_id(json_line): 
  data_dict = json.loads(json_line)
  return data_dict['task_op_id']

def split_to_id_file(original_file):
  current_id = 'blabla_xxxxxxxx'
  destination_file = 'processed_data2/data_over_one_id/break_into_ids/'
  with open(original_file) as f1:
    for line in f1:
        if current_id not in line:
          if not f2.closed:
            f2.close()
          task_op_id = get_id(line)
          current_id = "\"task_op_id\": " + task_op_id
          f2 = open(destination_file+str(task_op_id), 'a+')
        f2.write(line+'\n')
parasit
  • 486
  • 1
  • 4
  • 10