Trying to write a python script that will allow me to read a .csv file and mix up the values to a specific format/data structure in .json that I can then import into mongoDB. I'm using pedestrian data as my dataset and there are over a million entries with redundant data. I'm stuck on writing the actual script and translating that into my desired .json format.
data.csv - in table format for easier reading and raw
Id,Date_Time,Year,Month,Mdate,Day,Time,Sensor_ID,Sensor_Name,Hourly_Counts
1, 01-JUN-2009 00:00,2009,June,1,Monday,0,4,Town Hall (West),194
2, 01-JUN-2009 00:00,2009,June,1,Monday,0,17,Collins Place (South),21
3, 01-JUN-2009 00:00,2009,June,1,Monday,0,18,Collins Place (North),9
4, 01-JUN-2009 00:00,2009,June,1,Monday,0,16,Australia on Collins,39
5, 01-JUN-2009 00:00,2009,June,1,Monday,0,2,Bourke Street Mall (South),28
6, 01-JUN-2009 00:00,2009,June,1,Monday,0,1,Bourke Street Mall (North),37
7, 01-JUN-2009 00:00,2009,June,1,Monday,0,13,Flagstaff Station,1
8, 01-JUN-2009 00:00,2009,June,1,Monday,0,3,Melbourne Central,155
9, 01-JUN-2009 00:00,2009,June,1,Monday,0,15,State Library,98
10, 01-JUN-2009 00:00,2009,June,1,Monday,0,9,Southern Cross Station,7
11, 01-JUN-2009 00:00,2009,June,1,Monday,0,10,Victoria Point,8
12, 01-JUN-2009 00:00,2009,June,1,Monday,0,12,New Quay,30
Because I'll be uploading to mongoDB, the Id
in my context is redundant to me so I need my script to skip that. Sensor_ID
is not unique but I'm planning to make it the PK and create a list of objects differentiating the Hourly_Count
.
I'm aiming to generate a JSON object like this from the data:
**data.json**
{
{
"Sensor_ID": 4,
"Sensor_Name": "Town Hall(West)",
"countList":
[
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 0,
"Hourly_Counts": 194
},
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 1,
"Hourly_Counts": 82
}
]
},
{
"Sensor_ID": 17,
"Sensor_Name": "Collins Place(North)",
"countList":
[
{
"Date_Time": "01-JUN-2009 00:00",
"Year":2009,
"Month": "June",
"Mdate": 1,
"Day": "Monday",
"Time": 0,
"Hourly_Counts": 21
}
]
}
}
So on so forth. I'm trying to make it so when it reads a Sensor_ID
it creates an json object from the fields listed and adds it to the countList
. Added in another from station_ID = 4
to the countList
.
I am using python 2.7.x and I have looked at every question concerning this on stackoverflow and every other website. Very few rarely seem to want to restructure the .csv data when converting to .json so it's been a bit difficult.
What I have so far, still relatively new to python so thought this would be good to try out.
csvtojson.py
import csv
import json
def csvtojson():
filename = 'data.csv'
fieldnames = ('Id','Date_Time','Year','Month','Mdate','Day',
'Time','Sensor_ID','Sensor_Name', 'Hourly_Counts')
dataTime = ('Date_Time','Year','Month','Mdate','Day',
'Time', 'Hourly_Counts')
all_data = {}
with open(filename, 'rb') as csvfile:
reader = csv.DictReader(csvfile, fieldnames)
#skip header
next(reader)
current_sensorID = None
for row in reader:
sensor_ID = row['Sensor_ID']
sensorName = row['Sensor_Name']
data = all_data[sensor_ID] = {}
data['dataTime'] = dict((k, row[k]) for k in dataTime)
print json.dumps(all_data, indent=4, sort_keys=True)
if __name__ == "__main__":
csvtojson()
As far as I got is that countList
is in is own object but it's not creating a list of objects and may mess up the import to mongoDB. It is filtering through Sensor_ID
but overwriting if there are duplicates instead of adding to countList
. And I can't seem to get it in the format/data structure I want - I'm not even sure if that's the right structure, ultimate goal is to import the millions of tuples into mongoDB like the way I listed. I'm trying a small set now to test it out.