I have a large Mongo DB document with one JSON document per line. I need to convert it to columnar format (Pandas DF or CSV) to be able to analyze and draw same insights. I don't have enough experience to parse Mongo DB and I am not sure whether I do it right.
What I have done so far: To be able to use json.loads I have added a comma at the end of each line and wrapped the massive into [].
The sample(part of 1 JSON) looks like this:
{ "id" : "12345id" ,
"price" : 202.4,
"order_id" : "bc2341" ,
"order_time" :
{ "date" : "2018-03-27T12:22:50.935+0000" },
"work_data" :
{ "worker_id" : "5938686",
"job" :
[ { "id" : "663442",
"approx_time" : "2018-03-27T12:13:58.33Z",
"job_type" : null },
{ "id" : "663442",
"approx_time" : "2018-03-27T12:13:58.33Z",
"job_type" : null }]}
I have made this code to write to CSV:
report_file = open('..result.csv', 'w')
writer = csv.writer(report_file)
for row in config_file:
jobs = row.get('work_data',{}).get('job',{})
jobs_str = ''
if isinstance(jobs, list):
for job in jobs:
jobs_str += job.get('approx_time', '')
writer.writerow((row['order_time']['date'], jobs_str))
report_file.close
It works, but I can't figure out how to split multiples jobs into different columns to get different approx_time. Can someone help with the solution? Maybe there are better ways to convert JSONs from MongoDB to CSV/Pandas?
In the end I need something like this:
1 row per 1 JSON, if multiple jobs in JSON, then they must be in columns
+---------+--------+---------------+--------+---------------+
| id |id_job_1| approx_time_1 |id_job_2| approx_time_2 |
+---------+--------+---------------+--------+---------------+
| 12345id | 664442 | 27.03.2018 | 663442 | 27.03.2018 |
| 22345id | 633442 | 27.03.2018 | 622442 | 27.03.2018 |
+---------+--------+---------------+--------+---------------+