2

I came across several thread talking about ijson to load huge JSON files in python, as this is the way to not consume all memory.

My file is around 1.4 GB in size, it has several nodes (see below image), i am interested only in one node which hold most of the data (c_driver_location).

JSON_1.4GB

my goal is : i want to extract c_driver_location node data only and insert it into mysql db table (which will have four columns: id,longitude,latitude,timestamp).

table ddl:

create table drv_locations_backup7May2017 (id bigint unsigned auto_increment primary key, drv_fb_id varchar(50), latitude DECIMAL(10, 8) NOT NULL, longitude DECIMAL(11, 8) NOT NULL, timestamp int )

my problem is : i ran the first part of the attached code (till before connecting to mysql), but it is running since 20 hours and still did not finish parsing the json. (i tested on smaller file and it worked fine).

Is there an optimal way to make this faster and more efficient ?

import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import ijson
import pymysql.cursors
import pymysql


filename = "D:\json_file.json"
drv_col_list = ['drv_fb_id','latitude','longitude','timestamp']
drv_df = DataFrame(columns = drv_col_list)
drv_df.timestamp = drv_df.timestamp.astype(int)

counter = 0
with open(filename, 'r') as fd:
    parser = ijson.parse(fd)
    for prefix, event, value in parser:
        if prefix == 'c_driver_location' and str(event) == 'map_key':
            drv_fb_id = value
            counter = counter + 1
        elif prefix.endswith('.latitude'):
            latitude = value
        elif prefix.endswith('.longitude'):
            longitude = value
        elif prefix.endswith('.timestamp'):
            timestamp = value
        elif prefix.endswith(drv_fb_id) and str(event) == 'end_map':
            drv_df = drv_df.append(pd.DataFrame({'drv_fb_id':drv_fb_id,'latitude':latitude,'longitude':longitude,'timestamp':timestamp},index=[0]),ignore_index=True)
connection = pymysql.connect(host='53.000.00.00',
                             port = 3306,
                             user='user',
                             password='abcdefg',
                             db ='newdb',
                             # charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)
# write to mysql 
drv_df.to_sql(con=connection, name='drv_locations_backup7May2017', if_exists='replace', flavor='mysql')                                               
connection.close()
Anthony Kong
  • 37,791
  • 46
  • 172
  • 304
Ammar Khwaira
  • 35
  • 1
  • 8

1 Answers1

1

You just need to modify your code slightly to produce a data dump.

import ijson


outfile = "D:\upload_data.txt"
filename = "D:\json_file.json"
drv_col_list = ['drv_fb_id','latitude','longitude','timestamp']
timestamp = drv_df.timestamp.astype(int)


ofile = open(outfile, "rw")

counter = drv_fb_id = latitude = longitude = 0
with open(filename, 'r') as fd:
    parser = ijson.parse(fd)
    for prefix, event, value in parser:
        if prefix == 'c_driver_location' and str(event) == 'map_key':
            drv_fb_id = value
            counter = counter + 1
        elif prefix.endswith('.latitude'):
            latitude = value
        elif prefix.endswith('.longitude'):
            longitude = value
        elif prefix.endswith('.timestamp'):
            timestamp = value
        elif prefix.endswith(drv_fb_id) and str(event) == 'end_map':
            print >>ofile, ",".join(map(str, [drv_fb_id, latitude, longitude, timestamp]))           

close(ofile)

Now you have a comma delimited output in D:\upload_data.txt

The code is untested.

I do not have a test mysql database at the moment. I trust that the mysql manual is easy to follow . You table structure is not really complicated.

Anthony Kong
  • 37,791
  • 46
  • 172
  • 304
  • Thank you. I will test it and feedback accordingly. – Ammar Khwaira May 11 '17 at 11:39
  • i ran the code, the first error is " latitude is not defined". so i initialized it to latitude = '0' . also for logitude and timestamp. when i ran the code again i got this error : "TypeError: sequence item 1: expected string, Decimal found" . this is on the statement of print>>ofile. – Ammar Khwaira May 11 '17 at 11:59
  • `",".join ` expects a list of string. I updated the answer – Anthony Kong May 11 '17 at 12:03