0

I have a large CSV file that we will be using to import assets into our asset management database. Here is a smaller example for the CSV data.

Serial number,Movement type,Posting date
2LMXK1,101,1/5/15 9:00
2LMXK1,102,1/5/15 9:30
2LMXK1,201,1/5/15 10:30
2LMXK1,202,1/5/15 13:00
2LMXK1,301,1/5/15 14:00
JEMLP3,101,1/6/15 9:00
JEMLP3,102,1/7/15 10:00
JEMLP3,201,1/7/15 13:30
JEMLP3,202,1/7/15 15:30
JEMLP3,203,1/7/15 17:30
BR83GP,101,1/5/15 9:00
BR83GP,102,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,202,1/7/15 15:30
BR83GP,301,1/5/15 13:00
BR83GP,201,1/6/15 9:00
BR83GP,301,1/9/15 15:30

I need to parse this CSV and return a JSON object of only the latest movement type for each unique serial number.

I have a script successfully achieves: * Parse the CSV * Sort by Date and group by Serial number, get the latest date * Convert panda dataframe to JSON object (but missing serialnumber)

ISSUE: The "serial number" column is omitted when converting the dataframe to a JSON object. I think the problem is the fact that "serial number" is used as the groupby value. I tried the builtin to_json but that did not return the data in the correct format.

The data frame contains the correct filtered data that I need as we can see in print(last_movement).

How can I create a JSON object and obtain all of the columns in the dataframe?

#Import python modules
import ujson as json
import pandas as pd
import numpy as np

#CSV parse to panda dataframe
pdata = pd.read_csv('import.csv', skipinitialspace=True, parse_dates=[2])

#Sort by posting date to get rows by latest posting date/time only
last_movement = pdata.sort_values('Posting date').groupby('Serial number').last()

print(last_movement)
# RETURNS
# We know the dataframe is correct
#
#                      Movement type        Posting date 
#Serial number                                   
#2LMXK1                   301 2015-01-05 14:00:00
#BR83GP                   301 2015-01-09 15:30:00
#JEMLP3                   203 2015-01-07 17:30:00

out = last_movement.to_json()
print(out)

#RETURNS a JSON object that is aggregated by serial number
# {"Movement type":{"2LMXK1":301,"BR83GP":301,"JEMLP3":203},"Posting date":{"2LMXK1":1420466400000,"BR83GP":1420817400000,"JEMLP3":1420651800000}}

Here is the output when I tried a custom function to iterate the values and convert the dataframe to JSON object. Although this is a little bit better, it still does not have the serial number. It appears as though the "groupby" aggregation is causing some issues with the serial number column. Perhaps I need to somehow "ungroup" the resulting dataframe so I have my filtered data and can convert it to a JSON object.

#Convert panda dataframe to json object
def tojson(df):
    d = [ 
        dict([
            (colname, row[i])
            for i,colname in enumerate(df.columns)
        ])
        for row in df.values
    ]
    return json.dumps(d)

out = tojson(last_movement)

print(out)

# RETURNS
# MISSING SERIAL NUMBER
# [{"Posting date":1420466400,"Movement type":301},{"Posting date":1420817400,"Movement type":301},{"Posting date":1420651800,"Movement type":203}]
pengz
  • 2,279
  • 3
  • 48
  • 91

1 Answers1

0

I have located the answer. Set as_index=False in the groupby param. The JSON object is in the correct format and includes the serial number with this change.

Converting a Pandas GroupBy object to DataFrame

Aggregation functions will not return the groups that you are aggregating over if they are named columns, when as_index=True, the default. The grouped columns will be the indices of the returned object.

Passing as_index=False will return the groups that you are aggregating over, if they are named columns.

Community
  • 1
  • 1
pengz
  • 2,279
  • 3
  • 48
  • 91