I came across several thread talking about ijson to load huge JSON files in python, as this is the way to not consume all memory.
My file is around 1.4 GB in size, it has several nodes (see below image), i am interested only in one node which hold most of the data (c_driver_location).
my goal is : i want to extract c_driver_location node data only and insert it into mysql db table (which will have four columns: id,longitude,latitude,timestamp).
table ddl:
create table drv_locations_backup7May2017 (id bigint unsigned auto_increment primary key, drv_fb_id varchar(50), latitude DECIMAL(10, 8) NOT NULL, longitude DECIMAL(11, 8) NOT NULL, timestamp int )
my problem is : i ran the first part of the attached code (till before connecting to mysql), but it is running since 20 hours and still did not finish parsing the json. (i tested on smaller file and it worked fine).
Is there an optimal way to make this faster and more efficient ?
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import ijson
import pymysql.cursors
import pymysql
filename = "D:\json_file.json"
drv_col_list = ['drv_fb_id','latitude','longitude','timestamp']
drv_df = DataFrame(columns = drv_col_list)
drv_df.timestamp = drv_df.timestamp.astype(int)
counter = 0
with open(filename, 'r') as fd:
parser = ijson.parse(fd)
for prefix, event, value in parser:
if prefix == 'c_driver_location' and str(event) == 'map_key':
drv_fb_id = value
counter = counter + 1
elif prefix.endswith('.latitude'):
latitude = value
elif prefix.endswith('.longitude'):
longitude = value
elif prefix.endswith('.timestamp'):
timestamp = value
elif prefix.endswith(drv_fb_id) and str(event) == 'end_map':
drv_df = drv_df.append(pd.DataFrame({'drv_fb_id':drv_fb_id,'latitude':latitude,'longitude':longitude,'timestamp':timestamp},index=[0]),ignore_index=True)
connection = pymysql.connect(host='53.000.00.00',
port = 3306,
user='user',
password='abcdefg',
db ='newdb',
# charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)
# write to mysql
drv_df.to_sql(con=connection, name='drv_locations_backup7May2017', if_exists='replace', flavor='mysql')
connection.close()