2

I am attempting to convert a very large json file to csv. I have been able to convert a small file of this type to a 10 record (for example) csv file. However, when trying to convert a large file (on the order of 50000 rows in the csv file) it does not work. The data was created by a curl command with the -o pointing to the json file to be created. The file that is output does not have newline characters in it. The csv file will be written with csv.DictWriter() and (where data is the json file input) has the form

rowcount = len(data['MainKey'])
colcount = len(data['MainKey'][0]['Fields'])

I then loop through the range of the rows and columns to get the csv dictionary entries

csvkey = data['MainKey'][recno]['Fields'][colno]['name']
cvsval = data['MainKey'][recno][['Fields'][colno]['Values']['value']

I attempted to use the answers from other questions, but they did not work with a big file (du -m bigfile.json = 157) and the files that I want to handle are even larger.

An attempt to get the size of each line shows

myfile = open('file.json','r').
line = readline():
print len(line)

shows that this reads the entire file as a full string. Thus, one small file will show a length of 67744, while a larger file will show 163815116.

An attempt to read the data directly from

data=json.load(infile)

gives the error that other questions have discussed for the large files

An attempt to use the

def json_parse(self, fileobj, decoder=JSONDecoder(), buffersize=2048):


  yield results

as shown in another answer, works with a 72 kb file (10 rows, 22 columns) but seems to either lock up or take an interminable amount of time for an intermediate sized file of 157 mb (from du -m bigfile.json)

Note that a debug print shows that each chunk is 2048 in size as specified by the default input argument. It appears that it is trying to go through the entire 163815116 (shown from the len above) in 2048 chunks. If I change the chunk size to 32768, simple math shows that it would take 5,000 cycles through the loop to process the file.

A change to a chunk size of 524288 exits the loop approximately every 11 chunks but should still take approximately 312 chunks to process the entire file

If I can get it to stop at the end of each row item, I would be able to process that row and send it to the csv file based on the form shown below.

vi on the small file shows that it is of the form

{"MainKey":[{"Fields":[{"Value": {'value':val}, 'name':'valname'}, {'Value': {'value':val}, 'name':'valname'}}], (other keys)},{'Fields' ... }] (other keys on MainKey level) }

I cannot use ijson as I must set this up for systems that I cannot import additional software for.

Community
  • 1
  • 1
sabbahillel
  • 4,357
  • 1
  • 19
  • 36
  • 1
    `open('file.json','r').readline()` returns *one line*, the `for` loop then iterates over the individual characters of that one line. – Martijn Pieters Feb 12 '14 at 14:53
  • @MartijnPieters said it! – brunsgaard Feb 12 '14 at 14:54
  • Without access to your `bigfile.json` it is impossible for us to tell why my previous answer is not working for you. How big are the individual, distinct JSON entries in your file? Perhaps you should add a `print` statement to my function to show how large the buffer is getting? Right after `buffer += chunk` add a `print('buffer size now', len(buffer))` to see how much is being read from the file as it 'hangs'. – Martijn Pieters Feb 12 '14 at 14:55
  • Thank you for pointing out the chunk size problem. It appears that python is able to handle 1000 rows at a time using a chunk size of 8388606, but I would still want to find out a more efficient way of handling this if possible. – sabbahillel Feb 12 '14 at 18:06

1 Answers1

1

I wound up using a chunk size of 8388608 (0x800000 hex) in order to process the files. I then processed the lines that had been read in as part of the loop, keeping count of rows processed and rows discarded. At each process function, I added the number to the totals so that I could keep track of total records processed.

This appears to be the way that it needs to go.

Next time a question like this is asked, please emphasize that a large chunk size must be specified and not the 2048 as shown in the original answer.

The loop goes

first = True
for data in self.json_parse(inf):
  records = len(data['MainKey'])
  columns = len(data['MainKey'][0]['Fields'])
  if first:
    # Initialize output as DictWriter
    ofile, outf, fields = self.init_csv(csvname, data, records, columns)
    first = False
  reccount, errcount = self.parse_records(outf, data, fields, records)

Within the parsing routine

for rec in range(records):
  currec = data['MainKey'][rec]
  # If each column count can be different
  columns = len(currec['Fields'])
  retval, valrec = self.build_csv_row(currec, columns, fields)

To parse the columns use

for col in columns:
  dataname = currec['Fields'][col]['name']
  dataval = currec['Fields'][col]['Values']['value']

Thus the references now work and the processing is handled correctly. The large chunk apparently allows the processing to be fast enough to handle the data while being small enough not to overload the system.

sabbahillel
  • 4,357
  • 1
  • 19
  • 36