1

I'm pulling tweets via the Twitter Streaming API and tokenizing the text. I want to store everything. My current code involves

with open("tweets.json", "a") as f:
    f.write(cjson.encode(tweets))

But then, when I try to decode the very same file, I get errors! It's twitter, so the actual content is all over the place--urls, unicode, etc. Is there an equivalent of, e.g., re.escape for JSON? I don't really know enough about JSON to write something to escape every potential fly in the ointment, nor do I really want to spend the time. I read about the strict parameter, but I'm not sure that's enough.

ETA: Here's some sample code like everyone's been clamoring for. Sorry I was vague:

[["Just", "hanging", "with", "my", "cousins", "#tbt", "#adorable", "#grandmashouse", "@jess_lufrano", "@gabalvarezxo", "@robbybacs", "http://t.co/wgDntda7WB"], ["going", "to", "do", "things.", "Horrible", "things.", "Things", "done", "only", "in", "nightmares.", ">:>", "#muhahaha"], ["#truelove", "http://t.co/fEfT797Xit"], ["IMG_5667:", "Savini", "Francesco", "has", "added", "a", "photo", "to", "the", "pool:", "", "http://t.co/XYFsFIHG3M", "#national", "#pics"], ["I", "would", "rather", "11", "million", "Romanians", "and", "Bulgarians", "in", "Bromsgrove", "than", "one", "Sajid", "Javid", "#bbcqt"], ["lol", "Fuck", "around", "been", "the", "midgets!", "#OH", "#NO"], ["TODAY's", "SHOW:", "@markMGgeyer", "&", "@GusWorland's", "trip", "to", "Gallipoli", "on", "#anzacday", "+", "Sad", "revelations", "about", "Jon", "Mannah", "+", "Ray", "Martin."], ["Using", "valued", "objects", "for", "currency", "is", "fascinating.", "I", "want", "to", "see", "that", "really", "explored.", "#doctorwho"], ["@KevinMallonTri", "ya", "buddy!", "You", "know", "I'm", "ready..I", "leave", "tomorrow.#ready2Race"], ["My", "mom", "has", "two", "different", "lights", "with", "two", "different", "colour", "temps", "and", "it", "bugs", "me.", "I", "think", "there", "is", "something", "wrong", "with", "me.", "#filmkidproblems"], ["#Golf", "#PGA", "Quail", "Hollow", "bullish", "despite", "greens,", "no", "Tiger", "Woods", "-", "Charlotte", "Business", "Journal...", "http://t.co/UWn98AwpGT", "#MustFollow", "TWNews"], ["So", "what's", "the", "next", "#jam", "theme?"], ["#Me", "&", "my", "#homegirl", "solange", "#throwback", "#tbt", "#picoftheday", "#photo", "#instapic", "#instabomb", "#years", "#ago", "#boat\u2026", "http://t.co/86X0A2xRDa"],...

(NB: I truncated the sample, but I double-checked and it ends with ]], like I'm pretty sure it should. Again, I'm not exactly Cap'n Json.)

And the error:

decoder.decode(open("tweets.json").read()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 366, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/local/Cellar/python/2.7.3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 382, in raw_decode obj, end = self.scan_once(s, idx) ValueError: Invalid \escape: line 1 column 3243 (char 3243)

Actually, while we're here: is there any difference in the various Python JSON libraries (simplejson/json, cjson, ujson, etc.) w/r/t this kind of thing? Are any of them "escapier" on the encoding end/more flexible on the decoding end? I'm not too concerned with speed, just with not-hassles.

swizzard
  • 1,037
  • 1
  • 12
  • 28
  • 1
    It would be helpful if you could pin down which strings/objects aren't being serialized properly. In other words, create input so that we can reproduce the problem. (also, what is `cjson`? I'm familiar with `json` in the standard library -- is `cjson` a 3rd party extension?) – mgilson Apr 26 '13 at 02:35
  • @mgilson cjson is a fast c-based JSON library (that doesn't work on py3, as far as I know.) For fast repetitive applications, it performs significantly better than json, simpljson, and ujson. – Nisan.H Apr 26 '13 at 03:25
  • @swizzard could you post a sample input and the errors it produces? – Nisan.H Apr 26 '13 at 03:33
  • @Nisan.H -- Thanks for the clarification. I wonder if there are any plans to move `cjson` into the main python branch (much the same as `pickle` vs `cpickle`. I know that the distinction goes away in py3k with the fastest version being picked at runtime where possible -- Anyway, that's neither here nor there. I suppose the next question could be "what happens if you try regular `json`?. – mgilson Apr 26 '13 at 03:33
  • @mgilson I hope cjson gets picked, but that's an off topic here. Either way, I think we're stuck on this question until we get either a sample input or some explicit errors... There are a whole bunch of reasons why something would fail to deserialize from a JSON string. – Nisan.H Apr 26 '13 at 03:35
  • @Nisan.H -- Yeah, I agree. Sorry about the tangent. Hopefully OP will post a minimal example that can be used to reproduce the problem so we can fix it. – mgilson Apr 26 '13 at 03:43
  • Possible duplicate of [How to parse somewhat wrong JSON with Python?](http://stackoverflow.com/questions/1931454/how-to-parse-somewhat-wrong-json-with-python) – Paul Sweatte Jul 27 '16 at 17:35

0 Answers0