0

I have a program which takes in a JSON file, reads it line by line, aggregates the time into four bins depending on the time, and then outputs it to a file. However my file output contains extra characters due to concatenating a dictionary with a string.

For example this is how the output for one line looks:

dwQEZBFen2GdihLLfWeexA<bound method DataFrame.to_dict of            Friday  Monday  Saturday  Sunday  Thursday  Tuesday  Wednesday
Category                                                                 
Afternoon       0       0         3       2         2        0          1
Evening        20       4        16      11         4        3          5
Night          16       1        19       5         2        5          3>

The memory address is being concatenated as well into the output file.

Here is the code used for creating this specific file:

import json
import ast
import pandas as pd
from datetime import datetime

def cleanStr4SQL(s):
    return s.replace("'","`").replace("\n"," ")

def parseCheckinData():
    #write code to parse yelp_checkin.JSON
    # Add a new column "Time" to the DataFrame and set the values after left padding the values in the index

    with open('yelp_checkin.JSON') as f:
        outfile = open('checkin.txt', 'w')
        line = f.readline()
#        print(line)
        count_line = 0
        while line:
            data = json.loads(line)
#            print(data)
#            jsontxt = cleanStr4SQL(str(data['time']))
            # Parse the json and convert to a dictionary object

            jsondict = ast.literal_eval(str(data))
            outfile.write(cleanStr4SQL(str(data['business_id'])))

            # Convert the "time" element in the dictionary to a pandas DataFrame
            df = pd.DataFrame(jsondict['time'])

            # Add a new column "Time" to the DataFrame and set the values after left padding the values in the index
            df['Time'] = df.index.str.rjust(5, '0')

            # Add a new column "Category" and the set the values based on the time slot
            df['Category'] = df['Time'].apply(cat)

            # Create a pivot table based on the "Category" column
            pt = df.pivot_table(index='Category', aggfunc=sum, fill_value=0)

            # Convert the pivot table to a dictionary to get the json output you want
            jsonoutput = pt.to_dict
#            print(jsonoutput)
            outfile.write(str(jsonoutput))

            line = f.readline()
            count_line+=1
    print(count_line)
    outfile.close()
    f.close()

# Define a function to convert the time slots to the categories
def cat(time_slot):
    if '06:00' <= time_slot < '12:00':
        return 'Morning'
    elif '12:00' <= time_slot < '17:00':
        return 'Afternoon'
    elif '17:00' <= time_slot < '23:00':
        return 'Evening'
    else:
        return 'Night'

I was wondering if it was possible to remove the memory location from the output file in some way?

Any advice is appreciated and please let me know if you require any more information.

Thank you for reading

mrsquid
  • 605
  • 2
  • 9
  • 24

2 Answers2

0

Problem 1: missing parenthesis after to_dict, which causes this "memory address".

Problem 2: to produce a valid JSON, you will also need to wrap the output into an array

Problem 3: converting JSON to/from string is not safe with str or eval. Use json.loads() and .dumps()

import json

    ...
    line_chunks = []
    outfile.write("[")
    while line:
        ...
        jsondict = json.loads(data)  # problem 3
        ...
        jsonoutput = pt.to_dict()  # problem 1
        ...
    outfile.write(json.dumps(line_chunks))  # problems 2 and 3
Marat
  • 15,215
  • 2
  • 39
  • 48
  • That's interesting I didn't know that you had to wrap the JSON in an array. Also when I try to do json.loads(data) it gives this error: JSONDecodeError: Expecting property name enclosed in double quotes. For some reason after loading the data it replaces the double quotes for single quotes. – mrsquid Feb 02 '18 at 03:37
  • generally, you don't have to wrap JSON in an array, but here you generate a dictionary structure from every line of input file so if there is more than one line, it'll glue two dicts together. The result (something like `{key: value}{key: value}` is not a valid JSON. So, it needs to be transformed to something like `[{key: value}, {key: value}]` – Marat Feb 02 '18 at 03:40
  • Oh ok that makes sense thanks for the information and help with the code! I didn't realize there were other errors too. – mrsquid Feb 02 '18 at 03:46
0

The way you're working with JSON seems like streaming it, which is an unpleasant problem to deal with.

If you're not working with a terribly big JSON file, you're better off with

with open("input.json", "r") as input_json: json_data = json.load(input_json)

And then extract specific entries from json_data as you wish (just remember it is a dictionary), manipulate them and populate an output dict intended to be saved

Also, in python if you're using a with open(...) syntax, you don't need to close the file afterwards

formi23
  • 616
  • 7
  • 10