1

For my project I have to parse two big JSON files, one is 19.7 GB and another 66.3 GB. The structure of the JSON data is too complex. First Level Dictionary and again in 2nd level there might be List or Dictionary. These are all Network Log files, I have to parse those log files and do analysis. Is converting such big JSON file to CSV is advisable?

When I am trying to convert the smaller 19.7 GB JSON file to CSV file, it is having around 2000 columns and 0.5 millions of rows. I am using Pandas to parse those data. I have not touched the bigger file 66.3 GB. Whether I am going in right direction or not? When I 'll convert that bigger file, how many columns and rows will come out, there is no idea.

Kindly suggest any other good options if exists. Or is it advisable to directly read from JSON file and apply OOPs concept over it.

I have already read these articles: article 1 from Stack Overflow and article 2 from Quora

Debashis Sahoo
  • 5,388
  • 5
  • 36
  • 41
  • 1
    You should probably use c instead of python for this kind of stuff. – vishal Jul 11 '18 at 06:31
  • 1
    @debaonline4u No need to learn a new programming lang C, You can very well do this in Python, We have proccessed json with 20 million keys and much more nested than yours. Get that into Pandas dataframe first and then you can do any manipulation you want to.. – min2bro Jul 11 '18 at 06:34
  • 1
    I prefer instead of converting it to CSV use json streamer. – Raghav Patnecha Jul 11 '18 at 06:34
  • The structure cannot be that complex if it can be converted to CSV. Consider using a binary format which you can [memory map](https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html), a format like HDF5, or a database. – Jan Christoph Terasa Jul 11 '18 at 06:35
  • @debaonline4u, another option that we have tried is importing the entire json file in mongodb that would also a better alternative and then accessing it using pymongo. – min2bro Jul 11 '18 at 06:35
  • The best option would be not create such large JSON files in the first place. Parsing JSON requires a stateful parser and sbould be done at once. When the required RAM is larger than what is available you will have a problem. Some ideas might be found in the duplicate... – Klaus D. Jul 11 '18 at 06:37
  • 2
    Possible duplicate of [Is there a memory efficient and fast way to load big json files in python?](https://stackoverflow.com/questions/2400643/is-there-a-memory-efficient-and-fast-way-to-load-big-json-files-in-python) – Klaus D. Jul 11 '18 at 06:38
  • @min2bro right now my team is dealing with python and they are not ready to use any database. Anyway i'll surely propose this idea too. – Debashis Sahoo Jul 11 '18 at 06:38
  • @serbia99 Some one suggested the same - to use C instead of python, but right now my team decided to work with python. – Debashis Sahoo Jul 11 '18 at 06:42

1 Answers1

1

you might want to use dask its has similar syntax to pandas only its parallel (essentially its lots of parallel pandas datafames) and lazy (this helps with avoiding ram limitations).

you could use the read_json method and then do your calculations on the dataframe.

moshevi
  • 4,999
  • 5
  • 33
  • 50
  • how much time it'll take to study dask library and work. – Debashis Sahoo Jul 11 '18 at 06:45
  • if I read the whole 20 GB of data with read_json method, and convert it to a dataframe, then how much memory it's going to consume? – Debashis Sahoo Jul 11 '18 at 06:47
  • 2
    for me it was about a hour and a half, they great notebooks on their [website](https://mybinder.org/v2/gh/dask/dask-examples/master) . you can also watch this [video](https://www.youtube.com/watch?v=RA_2qdipVng) , it provides a good explanation about the library. – moshevi Jul 11 '18 at 06:50
  • 1
    you can specify the `blocksize` (so the json will not be in 1 partition) do the calculations and then for example save in csv files. because it is lazy at no time will the entire file be in memory. – moshevi Jul 11 '18 at 07:08
  • hi @moshevi, I am currently struggling in loading big files of Json lines in Pandas for my experiments. I am playing with the chunksize parameter, and in particular I noticed that bigger the chunks faster the parsing / greater the memory output. So far I reached a good balance with 10k chunks a decent parsing time and a memory usage up to 5.7 times the original file ( this proportion stands so far for 280Mb input files and 4.4 Gb). Do you think Dask can do any good here ? – Alessandro Benedetti Jul 16 '18 at 11:13
  • yes! when specifying `chunksize` in pandas you get a generator of pandas `dataframe` 's and then you do your calculations on the `dataframe` 's consecutively. however when specifying `blocksize` in `dask` you get a single dask `dataframe` and the calculations are done in parallel (leading to a great perforce improvement). if you have a ram limitation I recommend writing a pipeline (a set of calculation to be done on your dataframe) that saves the final info to disk/database, because the pipeline is lazy up until the saving of the info, at no time will the entire file will be in memory. – moshevi Jul 16 '18 at 11:27