python JSON parsing: can I load partially to speed up?

Question

I am processing large text files using Python. Each line of a file is a complete JSON message, and might be very long. I need to insert information about each line into a database. This info is very simple: the length of the line plus a unique ID which each message contains. So each line has the form

{"field1":"val1", ..., "ID":"12345", ..., "fieldK":"valK"}

and I need to extract "12345" from the message.

Right now I load the entire string using json.loads() then find the ID and ignore the rest.

My code is too slow and I need to speed it up. I am trying to see if there is a way of extracting "ID" faster than loading like the whole string. One option is to search the string for "ID" and then process :"12345". But it might be brittle if it so happens that there is a substring "ID" someplace else in the message.

So is there a way of somehow partially loading the line to find ID, which would be as robust as, but also faster than, loading the whole line?

Is each JSON document flat? -- ie, are there any nested lists/dictionaries? — Loren Abrams, Jan 28 '13 at 01:25
Each line is self-contained and is independent of all other lines. Is this what you're asking? Or are you aksing about the structure of each message? — I Z, Jan 28 '13 at 01:28
The latter. Do the lines (JSON documents) contain any nested lists or dictionaries? — Loren Abrams, Jan 28 '13 at 01:30
Do you control where the JSON file is created, and is modifying that an option? — Andrew Clark, Jan 28 '13 at 01:30
Could get kind of nasty then. I'd recommend trying to find a streaming JSON parser. I haven't actually had the need for one myself so unfortunately I can't recommend any. — Loren Abrams, Jan 28 '13 at 01:35
Hi, Maybe [this answer](http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs) is helpful to you. read file by stream. If you want to speed it up. it may be multiple-thread pattern. — Joe.wang, Jan 28 '13 at 01:39

score 0 · Accepted Answer · answered Jan 28 '13 at 01:31

I would recommend a couple of paths:

If your input is very large, it may be that loading it wholly into memory is wasteful. It may be faster to load/parse each line separately.

If the above won't help, then devising some way to search for the right ID in the file isn't a bad idea. Just verify that the input is kosher when you actually find the right ID: number. So you should:

Search (regex or otherwise) for the ID you expect.
For a match, actually parse the line and make sure it's valid. If it isn't (say, just ID: embedded in some string), drop it and keep searching.

Since non-legit occurrences of (2) should be rare, the verification doesn't have to be very efficient.

python JSON parsing: can I load partially to speed up?

1 Answers1