I am attempting to convert a very large json file to csv. I have been able to convert a small file of this type to a 10 record (for example) csv file. However, when trying to convert a large file (on the order of 50000 rows in the csv file) it does not work. The data was created by a curl command with the -o pointing to the json file to be created. The file that is output does not have newline characters in it. The csv file will be written with csv.DictWriter() and (where data is the json file input) has the form
rowcount = len(data['MainKey'])
colcount = len(data['MainKey'][0]['Fields'])
I then loop through the range of the rows and columns to get the csv dictionary entries
csvkey = data['MainKey'][recno]['Fields'][colno]['name']
cvsval = data['MainKey'][recno][['Fields'][colno]['Values']['value']
I attempted to use the answers from other questions, but they did not work with a big file (du -m bigfile.json = 157
) and the files that I want to handle are even larger.
An attempt to get the size of each line shows
myfile = open('file.json','r').
line = readline():
print len(line)
shows that this reads the entire file as a full string. Thus, one small file will show a length of 67744, while a larger file will show 163815116.
An attempt to read the data directly from
data=json.load(infile)
gives the error that other questions have discussed for the large files
An attempt to use the
def json_parse(self, fileobj, decoder=JSONDecoder(), buffersize=2048):
yield results
as shown in another answer, works with a 72 kb file (10 rows, 22 columns) but seems to either lock up or take an interminable amount of time for an intermediate sized file of 157 mb (from du -m bigfile.json)
Note that a debug print shows that each chunk is 2048 in size as specified by the default input argument. It appears that it is trying to go through the entire 163815116 (shown from the len above) in 2048 chunks. If I change the chunk size to 32768, simple math shows that it would take 5,000 cycles through the loop to process the file.
A change to a chunk size of 524288 exits the loop approximately every 11 chunks but should still take approximately 312 chunks to process the entire file
If I can get it to stop at the end of each row item, I would be able to process that row and send it to the csv file based on the form shown below.
vi on the small file shows that it is of the form
{"MainKey":[{"Fields":[{"Value": {'value':val}, 'name':'valname'}, {'Value': {'value':val}, 'name':'valname'}}], (other keys)},{'Fields' ... }] (other keys on MainKey level) }
I cannot use ijson as I must set this up for systems that I cannot import additional software for.