0

I have 16 JSON files each of them is about 14GB in size. I've tried the following approach to read them line by line.

with open(file_name, encoding="UTF-8") as json_file:
cursor = 0
for line_number, line in enumerate(json_file):
    print ("Processing line", line_number + 1,"at cursor index:", cursor)
    line_as_file = io.StringIO(line)
    # Use a new parser for each line
    json_parser = ijson.parse(line_as_file)
    for prefix, type, value in json_parser:
        #print ("prefix=",prefix, "type=",type, "value=",value,ignore_index=True)
        dfObj = dfObj.append({"prefix":prefix,"type":type,"value":value},ignore_index=True)
    cursor += len(line)

My aim is to load them into a pandas data frame to perform some search operations.

The problem is that this approach takes a lot of time to read the file.

Is there any other optimal approach to achieve this?

  • About loading a json into pandas, did you try `pd.read_json(file_name)`? – arnaud Apr 08 '20 at 14:35
  • Yes. That loads the whole file into the memory at once which crashes the kernel. – Yash Kantharia Apr 08 '20 at 14:45
  • size might be a deterrent. probably a database would serve u better – sammywemmy Apr 08 '20 at 20:54
  • take a look at mongoDB (stores JSONs) and [pymongo](https://stackoverflow.com/questions/16249736/how-to-import-data-from-mongodb-to-pandas). You can get everything out of the JSONs using field names, conditions, searches, etc. Might want to use Dask instead of pandas (very similar syntax) depending on how much RAM you have (pandas df is like 4x file size). – E. Bassett Apr 09 '20 at 14:46

2 Answers2

0

You can pass down json_file only once directly to ijson.parse instead of reading individual lines out of it. If your files have more than one top-level JSON value then you can use the multiple_value=True option (see here for a description).

Also make sure you are using an up-to-date ijson, and that the yajl2_c backend is the one in use (in ijson 3 you can see which backend is selected by looking at ijson.backend). For information on backends have a look here.

Rodrigo Tobar
  • 569
  • 4
  • 13
  • Yes, I did use "multiple_value=True" in my code. I will check out the ijson backend. Thanks! – Yash Kantharia Apr 09 '20 at 16:24
  • This surely did increase the speed but not significantly. The size of the data I have is the bottleneck here. Anyway, thank you! – Yash Kantharia Apr 10 '20 at 13:44
  • Another problem that might be dropping performance is how you append data into the DataFrame. Instead of creating it empty and calling `append` you could try a different approach: DataFrame constructor's `data` argument can take an iterable (or so it says [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)), so if you wrap your example code as a generator that yields (prefix, type, value) tuples then you might be able to pass that generator as the data source for the DataFrame. – Rodrigo Tobar Apr 10 '20 at 17:41
0

You can use Pandas builtin function
pandas.read_json()
The documentation is Here

Nirjal Paudel
  • 201
  • 5
  • 11