0

I have a large Mongo DB document with one JSON document per line. I need to convert it to columnar format (Pandas DF or CSV) to be able to analyze and draw same insights. I don't have enough experience to parse Mongo DB and I am not sure whether I do it right.

What I have done so far: To be able to use json.loads I have added a comma at the end of each line and wrapped the massive into [].

The sample(part of 1 JSON) looks like this:

{ "id" : "12345id" , "price" : 202.4, "order_id" : "bc2341" , "order_time" : { "date" : "2018-03-27T12:22:50.935+0000" }, "work_data" : { "worker_id" : "5938686", "job" : [ { "id" : "663442", "approx_time" : "2018-03-27T12:13:58.33Z", "job_type" : null }, { "id" : "663442", "approx_time" : "2018-03-27T12:13:58.33Z", "job_type" : null }]}

I have made this code to write to CSV:

 report_file = open('..result.csv', 'w')
 writer = csv.writer(report_file)
 for row in config_file:
    jobs = row.get('work_data',{}).get('job',{})
    jobs_str = ''
    if isinstance(jobs, list):
        for job in jobs:
            jobs_str += job.get('approx_time', '')
    writer.writerow((row['order_time']['date'], jobs_str))
    report_file.close

It works, but I can't figure out how to split multiples jobs into different columns to get different approx_time. Can someone help with the solution? Maybe there are better ways to convert JSONs from MongoDB to CSV/Pandas?

In the end I need something like this:

1 row per 1 JSON, if multiple jobs in JSON, then they must be in columns

+---------+--------+---------------+--------+---------------+ | id |id_job_1| approx_time_1 |id_job_2| approx_time_2 | +---------+--------+---------------+--------+---------------+ | 12345id | 664442 | 27.03.2018 | 663442 | 27.03.2018 | | 22345id | 633442 | 27.03.2018 | 622442 | 27.03.2018 | +---------+--------+---------------+--------+---------------+

  • Not sure exactly what you mean here: "how to split multiples jobs into different columns to get different approx_time". What are you trying to achieve ? Do you need to average 'approx_time'? Just a list of all 'approx_time' ? – Giannis Apr 13 '18 at 16:06

1 Answers1

0

Maybe start by getting your input closer to the required format. Have you investigated MongoDB export functions? This is in SO, and the documentation.

Also, when handing files, its best to use context managers.

If you need to have one CSV row per job, you should have your writter.writerow inside the job loop.

Update

In the CSV you are creating, you should place all the standard information on the first columns, and at the end of them append all extra jobs. When you load the CSV in Pandas, you will then get rows with NaN on Job # that does not exist for that row. For example:

id_1, job_1_id, job_2_id,
id_2, job_1_id, NaN

In the above example, row 1 had 2 jobs, row 2 had 1 job. You can see this on this post

Giannis
  • 5,286
  • 15
  • 58
  • 113
  • I need one CSV row per Global ID(sample is 1 row). It must all columns like ID, Price, Order_id, etc. And 2 jobs ID, 2 jobs approx_time and 2 job_type (or more jobs, in some rows there are more jobs than 2) – Aleksey Vinokurov Apr 13 '18 at 16:38
  • @AlekseyVinokurov you may have solved this by now, but added a possible solution to your problem – Giannis Apr 16 '18 at 14:05