1

I am processing large text files using Python. Each line of a file is a complete JSON message, and might be very long. I need to insert information about each line into a database. This info is very simple: the length of the line plus a unique ID which each message contains. So each line has the form

{"field1":"val1", ..., "ID":"12345", ..., "fieldK":"valK"}

and I need to extract "12345" from the message.

Right now I load the entire string using json.loads() then find the ID and ignore the rest.

My code is too slow and I need to speed it up. I am trying to see if there is a way of extracting "ID" faster than loading like the whole string. One option is to search the string for "ID" and then process :"12345". But it might be brittle if it so happens that there is a substring "ID" someplace else in the message.

So is there a way of somehow partially loading the line to find ID, which would be as robust as, but also faster than, loading the whole line?

I Z
  • 5,719
  • 19
  • 53
  • 100
  • Is each JSON document flat? -- ie, are there any nested lists/dictionaries? – Loren Abrams Jan 28 '13 at 01:25
  • Each line is self-contained and is independent of all other lines. Is this what you're asking? Or are you aksing about the structure of each message? – I Z Jan 28 '13 at 01:28
  • The latter. Do the lines (JSON documents) contain any nested lists or dictionaries? – Loren Abrams Jan 28 '13 at 01:30
  • Do you control where the JSON file is created, and is modifying that an option? – Andrew Clark Jan 28 '13 at 01:30
  • @ Loren: yes, withg nested stuff – I Z Jan 28 '13 at 01:32
  • Could get kind of nasty then. I'd recommend trying to find a streaming JSON parser. I haven't actually had the need for one myself so unfortunately I can't recommend any. – Loren Abrams Jan 28 '13 at 01:35
  • Hi, Maybe [this answer](http://stackoverflow.com/questions/12485718/python-read-file-as-stream-from-hdfs) is helpful to you. read file by stream. If you want to speed it up. it may be multiple-thread pattern. – Joe.wang Jan 28 '13 at 01:39

1 Answers1

0

I would recommend a couple of paths:

If your input is very large, it may be that loading it wholly into memory is wasteful. It may be faster to load/parse each line separately.

If the above won't help, then devising some way to search for the right ID in the file isn't a bad idea. Just verify that the input is kosher when you actually find the right ID: number. So you should:

  1. Search (regex or otherwise) for the ID you expect.
  2. For a match, actually parse the line and make sure it's valid. If it isn't (say, just ID: embedded in some string), drop it and keep searching.

Since non-legit occurrences of (2) should be rare, the verification doesn't have to be very efficient.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412