1

Trying to write a python script that will allow me to read a .csv file and mix up the values to a specific format/data structure in .json that I can then import into mongoDB. I'm using pedestrian data as my dataset and there are over a million entries with redundant data. I'm stuck on writing the actual script and translating that into my desired .json format.

data.csv - in table format for easier reading and raw

csv table

Id,Date_Time,Year,Month,Mdate,Day,Time,Sensor_ID,Sensor_Name,Hourly_Counts
1, 01-JUN-2009 00:00,2009,June,1,Monday,0,4,Town Hall (West),194
2, 01-JUN-2009 00:00,2009,June,1,Monday,0,17,Collins Place (South),21
3, 01-JUN-2009 00:00,2009,June,1,Monday,0,18,Collins Place (North),9
4, 01-JUN-2009 00:00,2009,June,1,Monday,0,16,Australia on Collins,39
5, 01-JUN-2009 00:00,2009,June,1,Monday,0,2,Bourke Street Mall (South),28
6, 01-JUN-2009 00:00,2009,June,1,Monday,0,1,Bourke Street Mall (North),37
7, 01-JUN-2009 00:00,2009,June,1,Monday,0,13,Flagstaff Station,1
8, 01-JUN-2009 00:00,2009,June,1,Monday,0,3,Melbourne Central,155
9, 01-JUN-2009 00:00,2009,June,1,Monday,0,15,State Library,98
10, 01-JUN-2009 00:00,2009,June,1,Monday,0,9,Southern Cross Station,7
11, 01-JUN-2009 00:00,2009,June,1,Monday,0,10,Victoria Point,8
12, 01-JUN-2009 00:00,2009,June,1,Monday,0,12,New Quay,30

Because I'll be uploading to mongoDB, the Id in my context is redundant to me so I need my script to skip that. Sensor_ID is not unique but I'm planning to make it the PK and create a list of objects differentiating the Hourly_Count.

I'm aiming to generate a JSON object like this from the data:

**data.json**

    {
        {
        "Sensor_ID": 4,
        "Sensor_Name": "Town Hall(West)",
        "countList": 
             [
                 {
                     "Date_Time": "01-JUN-2009 00:00",
                     "Year":2009,
                     "Month": "June",
                     "Mdate": 1,
                     "Day": "Monday",
                     "Time": 0,
                     "Hourly_Counts": 194
                 },
                 {
                     "Date_Time": "01-JUN-2009 00:00",
                     "Year":2009,
                     "Month": "June",
                     "Mdate": 1,
                     "Day": "Monday",
                     "Time": 1,
                     "Hourly_Counts": 82
                 }
             ]
        },
        {
        "Sensor_ID": 17,
        "Sensor_Name": "Collins Place(North)",
        "countList": 
             [
                 {
                     "Date_Time": "01-JUN-2009 00:00",
                     "Year":2009,
                     "Month": "June",
                     "Mdate": 1,
                     "Day": "Monday",
                     "Time": 0,
                     "Hourly_Counts": 21
                 }
             ]
        }
    }

So on so forth. I'm trying to make it so when it reads a Sensor_ID it creates an json object from the fields listed and adds it to the countList. Added in another from station_ID = 4 to the countList.

I am using python 2.7.x and I have looked at every question concerning this on stackoverflow and every other website. Very few rarely seem to want to restructure the .csv data when converting to .json so it's been a bit difficult.

What I have so far, still relatively new to python so thought this would be good to try out.

csvtojson.py

import csv
import json

def csvtojson():

    filename = 'data.csv'
    fieldnames = ('Id','Date_Time','Year','Month','Mdate','Day',
    'Time','Sensor_ID','Sensor_Name', 'Hourly_Counts')

    dataTime = ('Date_Time','Year','Month','Mdate','Day',
    'Time', 'Hourly_Counts')

    all_data = {}

    with open(filename, 'rb') as csvfile:
        reader = csv.DictReader(csvfile, fieldnames)
        #skip header
        next(reader)
        current_sensorID = None
        for row in reader:
            sensor_ID = row['Sensor_ID']
            sensorName = row['Sensor_Name']
            data = all_data[sensor_ID] = {}
            data['dataTime'] = dict((k, row[k]) for k in dataTime)


        print json.dumps(all_data, indent=4, sort_keys=True)    

if __name__ == "__main__":

    csvtojson()

As far as I got is that countList is in is own object but it's not creating a list of objects and may mess up the import to mongoDB. It is filtering through Sensor_ID but overwriting if there are duplicates instead of adding to countList. And I can't seem to get it in the format/data structure I want - I'm not even sure if that's the right structure, ultimate goal is to import the millions of tuples into mongoDB like the way I listed. I'm trying a small set now to test it out.

  • You could use pandas `read_csv` to load your data into a dataframe, do the required transformation and then use the `df.to_json` to output it to your required json structure. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html – salomonvh Apr 13 '17 at 12:00

2 Answers2

0

Please check the following.

https://github.com/gurdyals/test-repo/tree/master/MongoDB

Use " MongoDB_py.zip " files .

I did the same to convert csv data to MongoDB dict .

Please let me know if you have any questions.

Thanks

Gurdyal
  • 161
  • 2
  • 2
  • 11
0

Here is sample code for doing something similar to the above using python pandas. You could also do some aggregation in the dataframe if you wish to summarise the data to get rid of the redundant data.

import pandas as pd
import pprint as pp
import json
from collections import defaultdict

results = defaultdict(lambda: defaultdict(dict))

df = pd.read_csv('data.csv')
df.set_index(['Sensor_ID', 'Sensor_Name'],inplace=True)
df.reset_index(inplace=True)
grouped = df.groupby(['Sensor_ID', 'Sensor_Name']).apply(lambda x: x.drop(['Sensor_ID', 'Sensor_Name'], axis=1).to_json(orient='records'))
grouped.name = 'countList'
js = json.loads(pd.DataFrame(grouped).reset_index().to_json(orient='records'))
print json.dumps(js, indent = 4)

The output:

[
    {
        "Sensor_ID": 1, 
        "countList": "[{\"Id\":6,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":37}]", 
        "Sensor_Name": "Bourke Street Mall (North)"
    }, 
    {
        "Sensor_ID": 2, 
        "countList": "[{\"Id\":5,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":28}]", 
        "Sensor_Name": "Bourke Street Mall (South)"
    }, 
    {
        "Sensor_ID": 3, 
        "countList": "[{\"Id\":8,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":155}]", 
        "Sensor_Name": "Melbourne Central"
    }, 
    {
        "Sensor_ID": 4, 
        "countList": "[{\"Id\":1,\"Date_Time\":\" 01-JUN-2009 00:00\",\"Year\":2009,\"Month\":\"June\",\"Mdate\":1,\"Day\":\"Monday\",\"Time\":0,\"Hourly_Counts\":194}]", 
        "Sensor_Name": "Town Hall (West)"
    },
salomonvh
  • 1,739
  • 1
  • 15
  • 15
  • I have given that a try thanks. Does the field names have to be in each field listed or will it import into mongoDB fine without the field names? – user3504075 Apr 19 '17 at 02:01
  • Hmmm, depending on how you are importing your data, that could be problematic. You could modify the code above to output a nested structure containing the field names. I would say use a group_by on your index above and save each subset in a dictionary using the `.to_json(orient='records')` option. I will look into it for you. – salomonvh Apr 19 '17 at 11:31
  • Maybe see this answer: http://stackoverflow.com/questions/24374062/pandas-groupby-to-nested-json – salomonvh Apr 19 '17 at 11:38