I have a program which takes in a JSON file, reads it line by line, aggregates the time into four bins depending on the time, and then outputs it to a file. However my file output contains extra characters due to concatenating a dictionary with a string.
For example this is how the output for one line looks:
dwQEZBFen2GdihLLfWeexA<bound method DataFrame.to_dict of Friday Monday Saturday Sunday Thursday Tuesday Wednesday
Category
Afternoon 0 0 3 2 2 0 1
Evening 20 4 16 11 4 3 5
Night 16 1 19 5 2 5 3>
The memory address is being concatenated as well into the output file.
Here is the code used for creating this specific file:
import json
import ast
import pandas as pd
from datetime import datetime
def cleanStr4SQL(s):
return s.replace("'","`").replace("\n"," ")
def parseCheckinData():
#write code to parse yelp_checkin.JSON
# Add a new column "Time" to the DataFrame and set the values after left padding the values in the index
with open('yelp_checkin.JSON') as f:
outfile = open('checkin.txt', 'w')
line = f.readline()
# print(line)
count_line = 0
while line:
data = json.loads(line)
# print(data)
# jsontxt = cleanStr4SQL(str(data['time']))
# Parse the json and convert to a dictionary object
jsondict = ast.literal_eval(str(data))
outfile.write(cleanStr4SQL(str(data['business_id'])))
# Convert the "time" element in the dictionary to a pandas DataFrame
df = pd.DataFrame(jsondict['time'])
# Add a new column "Time" to the DataFrame and set the values after left padding the values in the index
df['Time'] = df.index.str.rjust(5, '0')
# Add a new column "Category" and the set the values based on the time slot
df['Category'] = df['Time'].apply(cat)
# Create a pivot table based on the "Category" column
pt = df.pivot_table(index='Category', aggfunc=sum, fill_value=0)
# Convert the pivot table to a dictionary to get the json output you want
jsonoutput = pt.to_dict
# print(jsonoutput)
outfile.write(str(jsonoutput))
line = f.readline()
count_line+=1
print(count_line)
outfile.close()
f.close()
# Define a function to convert the time slots to the categories
def cat(time_slot):
if '06:00' <= time_slot < '12:00':
return 'Morning'
elif '12:00' <= time_slot < '17:00':
return 'Afternoon'
elif '17:00' <= time_slot < '23:00':
return 'Evening'
else:
return 'Night'
I was wondering if it was possible to remove the memory location from the output file in some way?
Any advice is appreciated and please let me know if you require any more information.
Thank you for reading