At present, I've got a lot of Tweets, and I'm going to store them on a server in my lab. I'm having a bit of an issue determining just how I intend to do this, however.
For example, a Tweet has this format:
{
"contributors": null,
"coordinates": null,
"created_at": "Tue Jul 10 17:09:12 +0000 2012",
"entities": {
"hashtags": [{
"indices": [62, 78],
"text": "thestrongnation"
}],
"urls": [],
"user_mentions": [{
"id": 376483630,
"id_str": "376483630",
"indices": [0, 8],
"name": "SherryHonig",
"screen_name": "sahonig"
}]
},
"favorited": false,
"geo": null,
"id": 222739261219282945,
"id_str": "222739261219282945",
"in_reply_to_screen_name": "sahonig",
"in_reply_to_status_id": 222695060528037889,
"in_reply_to_status_id_str": "222695060528037889",
"in_reply_to_user_id": 376483630,
"in_reply_to_user_id_str": "376483630",
"place": {
"attributes": {},
"bounding_box": {
"coordinates": [
[
[-106.645646, 25.837164000000001],
[-93.508038999999997, 25.837164000000001],
[-93.508038999999997, 36.500703999999999],
[-106.645646, 36.500703999999999]
]
],
"type": "Polygon"
},
"country": "United States",
"country_code": "US",
"full_name": "Texas, US",
"id": "e0060cda70f5f341",
"name": "Texas",
"place_type": "admin",
"url": "http://api.twitter.com/1/geo/id/e0060cda70f5f341.json"
},
"retweet_count": 0,
"retweeted": false,
"source": "web",
"text": "@sahonig BOOM !!!! I feel a 1 coming on!!! Awesome! #thestrongnation",
"truncated": false,
"user": {
"contributors_enabled": false,
"created_at": "Wed Feb 15 14:40:48 +0000 2012",
"default_profile": false,
"default_profile_image": false,
"description": "Living life on 30A & doing it my way. My mind is Stronger than physical challenge. Runner, Crosfit, Fitness Challenges. Proud member of #thestrongnation. ",
"favourites_count": 17,
"follow_request_sent": null,
"followers_count": 215,
"following": null,
"friends_count": 184,
"geo_enabled": true,
"id": 493181025,
"id_str": "493181025",
"is_translator": false,
"lang": "en",
"listed_count": 4,
"location": "Seagrove Beach, FL",
"name": "30A My Way \u2600",
"notifications": null,
"profile_background_color": "c0deed",
"profile_background_image_url": "http://a0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
"profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/590670431/aj7p0c6j2oevdj240jz2.jpeg",
"profile_background_tile": true,
"profile_image_url": "http://a0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
"profile_image_url_https": "https://si0.twimg.com/profile_images/2381704869/b7bizspexjgmyspqesg0_normal.jpeg",
"profile_link_color": "0084B4",
"profile_sidebar_border_color": "C0DEED",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"profile_use_background_image": true,
"protected": false,
"screen_name": "30A_MyWay",
"show_all_inline_media": false,
"statuses_count": 1731,
"time_zone": "Central Time (US & Canada)",
"url": null,
"utc_offset": -21600,
"verified": false
}
}
This is, of course, a dictionary in Python, which happens to follow the JSON format. MongoDB conveniently accepts these in JSON format, but the thing is, I don't want all of the information provided. The Streaming API gives me 20 fields, when really I would only like to mess with userid, text, and location at the moment. I initially intended to parse through this and extract just the text that I wanted, but I couldn't find a reliable parser, and I feel like writing one would just be a waste of time given the conditions in which this is being developed.
However, another solution I'm considering is that, since these are being read into MongoDB, perhaps I could store only what I want within the dictionary and get rid of the rest. The only issue that's presented is that the file format as received by Twitter places all of the dictionary on the same line - I feel like I'd have to do some sort of extraction regardless.
Does anyone else have any suggestions?