3

I have a stream of strings where I need to analyze each one and check whether it is a valid JSON. The pythonic way (EAFP) dictates something like this:

import json
def parse_json(string):
    try:
        return json.loads(string)
    except:
        return string

The problem is that a significant number of strings are not JSONs, and the many exceptions raised by this code slow the process quite a bit.

I am looking for some way to try and parse the text as JSON, returning some kind of pre-defined value (e.g. an empty tuple()) indicating the string is not JSON compatible. I don't mind hacking around the standard json package (overriding a function or two..) if this is the easiest solution.

Any suggestions?

Update: As I'm only interested in 'complex' JSONs (arrays and objects), I've eventually decided to go with a simple if to check the first and last characters of the string:

try:
    import ujson as json
except ImportError:
    import json


def parse_json(string):
    if len(text) > 0:
        text = text.strip()
        if text != "" and ((text[0] == "{" and text[-1] == "}") or (text[0] == "[" and text[-1] == "]")):
            try:
                return json.loads(string)
            except:
                return string

ujson is a much more efficient implementation than Python's standard json. Additionally, skipping all strings which are not wrapped with [] or {} reduces the amount of exceptions by a large factor. It turns out that mixing LBYL and EAFP was what I needed.

redlus
  • 2,301
  • 2
  • 12
  • 16
  • 6
    I doubt very much that the overhead is in catching the exception, but rather in trying to parse the string in the first place. – Daniel Roseman Mar 01 '17 at 19:44
  • As @Danield points out, try/except handling is relatively cheap in Python. Checking if a string is valid JSON and then converting it is likely to be slower than what you're doing. However if the invalid ones are all broken in the same way, you _may_ be able to avoid trying to convert them with `loads()` if a way to checking for that condition is extremely quick. – martineau Mar 01 '17 at 20:34
  • @DanielRoseman while a try-except clause is practically free when no exceptions are raised, but costly otherwise (e.g. http://stackoverflow.com/a/2522013/4369617). As I have to process hundreds of millions of strings per day, this becomes a burden. – redlus Mar 07 '17 at 17:16
  • @marineau interesting approach. I'll analyze what are the most common exception causes and check them accordingly – redlus Mar 07 '17 at 17:17

2 Answers2

1

More compact way should be like this. Json lib process only str, bytes or bytearray structures, so only consider them. Instead of if len(text)==0, if not text is much faster for long strings, we don't want to know length of text. Json lib may raise JsonDecoderError. First last characters of text can be checked back-reference with regex but I tried possible edge cases such as '{]' and '[}', they don't fail.

def is_json(text: str) -> bool:
    from json import loads, JSONDecodeError
 
    if not isinstance(text, (str, bytes, bytearray)):
        return False
    if not text:
        return False
    text = text.strip()
    if text[0] in {'{', '['} and text[-1] in {'}', ']'}:
        try:
            loads(text)
        except (ValueError, TypeError, JSONDecodeError):
            return False
        else:
            return True
    else:
        return False

EDIT: We should check whether text is empty or not, not to raise IndexError.

def is_json(text: str) -> bool:
    if not isinstance(text, (str, bytes, bytearray)):
        return False
    if not text:
        return False
    text = text.strip()
    if text:
        if text[0] in {'{', '['} and text[-1] in {'}', ']'}:
            try:
                loads(text)
            except (ValueError, TypeError, JSONDecodeError):
                return False
            else:
                return True
        else:
            return False
    return False
0

for speed issue you can use multiprocessing to speed up the parsing , for example by utilizing 5 workers in multiprocessing you parse 5 string at a time instead of one

if i understand second part of question , for custom exception do something like this to get custom error for different error that return by json:

def parse_json(string):
    try:
        return json.loads(string)
    except exception1 as ex1:
        print(ex1)
        return string
    except exception2 as ex2:
        print(ex2)
        return string
    exc....

or if you don't want to get any exception on error raising do this , just give it a pass to do nothing , not showing up anything:

def parse_json(string):

try:
    return json.loads(string)
except exception1 as ex1:
    pass
ali
  • 139
  • 8
  • multi/distributed processing is an option, but it does not reduce the extra processing time of exceptions, which I want to offload. Ideally, I'd like the json.loads() function to return a pre-defined value when the parsing fails instead of raising an exception. – redlus Mar 07 '17 at 17:40