0

I've got the json file from twitter API, the json looks like this:

{"in_reply_to_user_id_str": null, "geo": null, "id": 100689407677440000, "lang": "en", "in_reply_to_user_id": null, "contributors": null, "source": "<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>", "place": null, "user": {"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/512692008/Screen_shot_2012-01-11_at_11.58.58_PM.png", "id_str": "181218735", "profile_link_color": "0084B4", "profile_image_url_https": "https://pbs.twimg.com/profile_images/378800000007522671/f4552422d443160c075fb3d521ffb3c2_normal.jpeg", "default_profile": false, "id": 181218735, "name": "Author Al King", "contributors_enabled": false, "is_translation_enabled": false, "profile_use_background_image": true, "friends_count": 109, "notifications": false, "utc_offset": -14400, "statuses_count": 3521, "listed_count": 3, "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "default_profile_image": false, "profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/512692008/Screen_shot_2012-01-11_at_11.58.58_PM.png", "profile_text_color": "333333", "description": "Order your eBook copy of the Truth-Selling LET IT BE KNOWN at http://t.co/9m4DBuSatZ", "time_zone": "Eastern Time (US & Canada)", "geo_enabled": false, "follow_request_sent": false, "profile_background_color": "C0DEED", "favourites_count": 1, "lang": "en", "verified": false, "profile_image_url": "http://pbs.twimg.com/profile_images/378800000007522671/f4552422d443160c075fb3d521ffb3c2_normal.jpeg", "followers_count": 117, "screen_name": "LetItBeKnownAKP", "url": "http://t.co/zl3wbvRpco", "is_translator": false, "profile_background_tile": true, "has_extended_profile": false, "following": false, "created_at": "Sat Aug 21 16:25:13 +0000 2010", "protected": false, "location": "", "entities": {"description": {"urls": [{"expanded_url": "http://www.thealkingpointofview.com", "display_url": "thealkingpointofview.com", "url": "http://t.co/9m4DBuSatZ", "indices": [62, 84]}]}, "url": {"urls": [{"expanded_url": "http://www.thealkingpointofview.com", "display_url": "thealkingpointofview.com", "url": "http://t.co/zl3wbvRpco", "indices": [0, 22]}]}}}, "truncated": false, "retweet_count": 0, "id_str": "100689407677440000", "retweeted": false, "created_at": "Mon Aug 08 22:06:40 +0000 2011", "favorited": false, "entities": {"urls": [], "hashtags": [{"text": "LETITBEKNOWNLIVERADIO", "indices": [14, 36]}], "user_mentions": [], "symbols": []}, "in_reply_to_status_id": null, "coordinates": null, "favorite_count": 0, "is_quote_status": false, "text": "WED at 8pm on #LETITBEKNOWNLIVERADIO our Discussion \"Violence among the YOUTH PT 2..... Bullying, Gang Violenc\u2026 (cont) http://deck.ly/~ZN88q", "in_reply_to_status_id_str": null, "in_reply_to_screen_name": null}

what i do is extract only "id","lang","text" from above json, but when i load the json, error occurred, here is my code:

    import json
    with open ("tweet.json_test_1") as json_data:
    dataText = json.load(json_data)
    print (dataText)

the error is: ValueError: Extra data: line 1 column 2836 - line 2 column 1 (char 2835 - 2868)

sorry if this is a repeated question, i am new in python and ML. thanks

HAO CHEN
  • 1,209
  • 3
  • 18
  • 32
  • Thx, that is what i thought before, yes, it is loads of JSON objects, i tried to copy only one JSON in the tweet.json_test_1, and tried to use json.loads as well, but the error is: TypeError: the JSON object must be str, not 'TextIOWrapper' – HAO CHEN Jul 21 '15 at 11:44
  • @HAOCHEN: the code in your question does not correspond to the error that does not correspond to the json data that you've shown. It makes it difficult to help you. [Create a standalone test (code + data)](http://stackoverflow.com/help/mcve) and post whatever error it produces (the full traceback). Limit your questions to one issue per question – jfs Jul 21 '15 at 21:11
  • THX,my mistake, i got the wrong json and i've fixed already. – HAO CHEN Jul 22 '15 at 09:50

1 Answers1

0

It seems like your tweet.json_test_1 file has multiple JSON objects, one per line in fact, so you'd do better to read it line by line and load each JSON object as a string. I recommend using a try except to catch if some lines don't have JSON. But bear in mind this means that if you're getting no output then none of the lines contain valid JSON.

import json

with open ("tweet.json_test_1") as json_data:
    for line in json_data:
        try:
            dataText = json.loads(line)
        except ValueError:
            continue
        print (dataText)
        #Do other stuff here, especially if you want to retain all the JSON objects

Incidentally, if you're the one creating the tweet.json_test_1 file then it's recommended to just put all the JSON objects in a list, and then json.load should work fine.

SuperBiasedMan
  • 9,814
  • 10
  • 45
  • 73
  • thanks, it worked, some of json loaded, but another error is 'charmap' codec can't encode character '\u0251' in position 511: character maps to , i noticed it was because the coding, the json i loaded was encode by the utf-8, is this problem? – HAO CHEN Jul 21 '15 at 14:05
  • Yeah, it looks like that character is part of UTF-16. There's an answer for that error [here](http://stackoverflow.com/a/27093194/4374739). That answer uses it on soup but you can just use it like this: `line.encode('utf-8')`. – SuperBiasedMan Jul 21 '15 at 14:08
  • :( oops, not working, the JSON object must be str, not 'bytes' – HAO CHEN Jul 21 '15 at 14:17
  • That shouldn't be happening, `encode` returns a string. Is this your full line of code: `dataText = json.loads(line.encode('utf-8'))` – SuperBiasedMan Jul 21 '15 at 14:22
  • with open ('tweet_2.json') as json_data: for line in json_data: try: dataText = json.loads(line.encode('utf-8')) except ValueError: continue print (dataText) – HAO CHEN Jul 21 '15 at 14:32