1

I have a list of ugly looking JSON objects in a text file, one per line. I would like to make them print nicely and send the results to a file.

My attempt to use the command-line python version of json.tool:

parallel python -mjson.tool < jsonList

However, something seems to be going wrong in the parsing of this json, as python's json.tool attempts to open it as multiple arguments and thus throws:

IOError: [Errno 2] No such file or directory: {line contents, which contain single quotes, spaces, double quotes}

How can I compel this to treat each line-separated object as a single argument to the module? Opening the file directly in python and processing it serially is an inefficient solution because the file is enormous. Attempting to do so pegs the CPU.

argentage
  • 2,758
  • 1
  • 19
  • 28

3 Answers3

1

Well the json module already has something similar to what you have in mind.

>>> import json
>>>
>>> my_json = '["cheese", {"cake":["coke", null, 160, 2]}]'
>>> parsed = json.loads(my_json)
>>> print json.dumps(parsed, indent=4, sort_keys=True)
[
    "cheese", 
    {
        "cake": [
            "coke", 
            null, 
            160, 
            2
        ]
    }
]

And you can just input my_json from a text file using open in r mode.

Games Brainiac
  • 80,178
  • 33
  • 141
  • 199
  • I avoided using the open command directly in a python script because I don't know what python will do with a 2 GB text file. – argentage Aug 22 '13 at 18:44
  • Take a look @ this: http://stackoverflow.com/questions/7134338/max-size-of-a-file-python-can-open – Games Brainiac Aug 22 '13 at 18:47
  • 1
    @airza: "open" a file doesn't mean "load entire file into memory". Just iterate it line by line (`for line in file`) and do the conversion as shown. It will work no matter how big the file is. – georg Aug 22 '13 at 18:49
  • This answer works, but it pegs the CPU, which is why I am trying to parallelize the process on the machine's 16 cores. Perhaps it will not be faster than the simple answer, but I would like to find out- hence asking the question that I asked. – argentage Aug 22 '13 at 18:57
  • @airza: Well, its a 2 GB file, regardless of what you use, it will take time. If you need a fast implementation, then why not use pypy? Its much faster than regular Python. However, I doubt there will be too much of a change since, open() is written directly in C, and so I really do not think it can get faster than this. – Games Brainiac Aug 22 '13 at 19:05
1

GNU Parallel will by default put the input as arguments on the command line. So what you do is:

python -mjson.tool \[\"cheese\",\ \{\"cake\":\[\"coke\",\ null,\ 160,\ 2\]\}\]

But what you want is:

echo \[\"cheese\",\ \{\"cake\":\[\"coke\",\ null,\ 160,\ 2\]\}\] | python -mjson.tool

GNU Parallel can do that with --pipe -N1:

parallel -N1 --pipe python -mjson.tool < jsonList

10 seconds installation:

wget -O - pi.dk/3 | bash

Watch the intro video for a quick introduction: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1 or at

Walk through the tutorial (man parallel_tutorial). You command line with love you for it.

Ole Tange
  • 31,768
  • 5
  • 86
  • 104
0

Two problems with my approach, which I eventually solved:

the default parallelization will spawn a new python VM for each thread, which is... slow. So slow.

The default json.tool does the naive implementation, but somehow is confusing the number of incoming arguments.

I wrote this:

import sys
import json
for i in sys.argv[1:]:
    o = json.loads(i)
    json.dump(o, sys.stdout, indent=4, separators=(',',': '))

Then called it like this:

parallel -n 500 python fastProcess.py < filein > prettyfileout

I'm not quite sure of the optimal value of n, but the script is 4-5x faster in wall clock time than the naive implementation due to the ability to use multiple cores.

argentage
  • 2,758
  • 1
  • 19
  • 28