0

At present, I've got a lot of Tweets, and I'm going to store them on a server in my lab. I'm having a bit of an issue determining just how I intend to do this, however.

For example, a Tweet has this format:

{
    "contributors": null,
    "coordinates": null,
    "created_at": "Tue Jul 10 17:09:12 +0000 2012",
    "entities": {
        "hashtags": [{
            "indices": [62, 78],
            "text": "thestrongnation"
        }],
        "urls": [],
        "user_mentions": [{
            "id": 376483630,
            "id_str": "376483630",
            "indices": [0, 8],
            "name": "SherryHonig",
            "screen_name": "sahonig"
        }]
    },
    "favorited": false,
    "geo": null,
    "id": 222739261219282945,
    "id_str": "222739261219282945",
    "in_reply_to_screen_name": "sahonig",
    "in_reply_to_status_id": 222695060528037889,
    "in_reply_to_status_id_str": "222695060528037889",
    "in_reply_to_user_id": 376483630,
    "in_reply_to_user_id_str": "376483630",
    "place": {
        "attributes": {},
        "bounding_box": {
            "coordinates": [
                [
                    [-106.645646, 25.837164000000001],
                    [-93.508038999999997, 25.837164000000001],
                    [-93.508038999999997, 36.500703999999999],
                    [-106.645646, 36.500703999999999]
                ]
            ],
            "type": "Polygon"
        },
        "country": "United States",
        "country_code": "US",
        "full_name": "Texas, US",
        "id": "e0060cda70f5f341",
        "name": "Texas",
        "place_type": "admin",
        "url": "http://api.twitter.com/1/geo/id/e0060cda70f5f341.json"
    },
    "retweet_count": 0,
    "retweeted": false,
    "source": "web",
    "text": "@sahonig BOOM !!!! I feel a 1 coming on!!! Awesome! #thestrongnation",
    "truncated": false,
    "user": {
        "contributors_enabled": false,
        "created_at": "Wed Feb 15 14:40:48 +0000 2012",
        "default_profile": false,
        "default_profile_image": false,
        "description": "Living life on 30A & doing it my way. My mind is Stronger than physical challenge. Runner, Crosfit, Fitness Challenges. Proud member of #thestrongnation. ",
        "favourites_count": 17,
        "follow_request_sent": null,
        "followers_count": 215,
        "following": null,
        "friends_count": 184,
        "geo_enabled": true,
        "id": 493181025,
        "id_str": "493181025",
        "is_translator": false,
        "lang": "en",
        "listed_count": 4,
        "location": "Seagrove Beach, FL",
        "name": "30A My Way \u2600",
        "notifications": null,
        "profile_background_color": "c0deed",
        "profile_background_image_url": "http://a0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
        "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
        "profile_background_tile": true,
        "profile_image_url": "http://a0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
        "profile_image_url_https": "https://si0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
        "profile_link_color": "0084B4",
        "profile_sidebar_border_color": "C0DEED",
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_text_color": "333333",
        "profile_use_background_image": true,
        "protected": false,
        "screen_name": "30A_MyWay",
        "show_all_inline_media": false,
        "statuses_count": 1731,
        "time_zone": "Central Time (US & Canada)",
        "url": null,
        "utc_offset": -21600,
        "verified": false
    }
}

This is, of course, a dictionary in Python, which happens to follow the JSON format. MongoDB conveniently accepts these in JSON format, but the thing is, I don't want all of the information provided. The Streaming API gives me 20 fields, when really I would only like to mess with userid, text, and location at the moment. I initially intended to parse through this and extract just the text that I wanted, but I couldn't find a reliable parser, and I feel like writing one would just be a waste of time given the conditions in which this is being developed.

However, another solution I'm considering is that, since these are being read into MongoDB, perhaps I could store only what I want within the dictionary and get rid of the rest. The only issue that's presented is that the file format as received by Twitter places all of the dictionary on the same line - I feel like I'd have to do some sort of extraction regardless.

Does anyone else have any suggestions?

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
Noc
  • 519
  • 7
  • 18
  • 1
    The example code for python with pymongo here http://stackoverflow.com/questions/10855518/optimization-dumping-json-from-a-streaming-api-to-mongo/10865813#10865813 should help a lot – Asya Kamsky Jul 11 '12 at 22:43

1 Answers1

1

If you have to, you can use json.loads (which will return a list of dicts as formatted above) to take the result and put it into a Python structure if not already, so it can be manipulated. (But one would normally be using some Python Twitter library that would do this transparently)

Just create a new dict of the data you want and insert that into MongoDB, eg:

Assuming ret = a tweet response as above

mydata = {
    'name': ret['user']['screen_name'],
    'text': ret['text']
}

print mydata['name'], 'wrote', mydata['text'] # or something

# insert mydata into appropriate MongoDB DB/collection here
Jon Clements
  • 138,671
  • 33
  • 247
  • 280