3

Have a set of tweets that have been saved to a .txt file.

I want to place certain attributes in a sqlite table in Python. I successfully created the table.

import pandas
import sqlite3
conn = sqlite3.connect('twitter.db')
c = conn.cursor()

c.execute(CREATE TABLE Tweet
(
   created_at VARCHAR2(25),
   id VARCHAR2(25),
   text VARCHAR2(25)
   source VARCHAR2(25),
   in-reply_to_user_ID VARCHAR2(25), 
   retweet_Count VARCHAR2(25)

)

Before I even attempted to add the parsed data into the db, I tried to create a data frame with it just to view.

tweets =pandas.read_table('file.txt', sep=',')

I get the error:

CParserError: Error tokenizing data. C error: Expected 63 fields in line 3, saw 69

My assumption is there are ',' not only separating the fields, but within the strings too.

Also, twitter data comes in a format that I have not worked with before. Each field starts with the variable name in parenthesis, a colon, then the data separated by more parenthesis. Like:

"created_at":"Fri Oct 11 00:00:03 +0000 2013",

So how can I get this into a standard table format with the variable names at the top?

A full example of a tweet is this:

{"created_at":"Fri Oct 11 00:00:03 +0000 2013","id":388453908911095800,"id_str":"388453908911095809","text":"LAGI PUN VISITORS DATANG PUKUL 9 AH","source":"<a href=\"http://www.tweetdeck.com\" rel=\"nofollow\">TweetDeck</a>","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":447800506,"id_str":"447800506","name":"§yazwina·","screen_name":"_SAireen","location":"SSP","url":"http://flavors.me/syazwinaaireen#","description":"Absence makes the heart grow fonder. Stay us x @_DFitri's","protected":false,"followers_count":806,"friends_count":702,"listed_count":2,"created_at":"Tue Dec 27 08:29:53 +0000 2011","favourites_count":7478,"utc_offset":28800,"time_zone":"Beijing","geo_enabled":true,"verified":false,"statuses_count":32558,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"DBE9ED","profile_background_image_url":"http://a0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_image_url_https":"https://si0.twimg.com/profile_background_images/378800000056283804/65d84665fbb81deba13427e8078a3eff.png","profile_background_tile":true,"profile_image_url":"http://a0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_image_url_https":"https://si0.twimg.com/profile_images/378800000264138431/fd9d57bd1b1609f36fd7159499a94b6e_normal.jpeg","profile_banner_url":"https://pbs.twimg.com/profile_banners/447800506/1369969522","profile_link_color":"FA0096","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"E6F6F9","profile_text_color":"333333","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[]},"favorited":false,"retweeted":false,"filter_level":"medium","lang":"it"}
mpg
  • 3,679
  • 8
  • 36
  • 45
  • 2
    You unfortunately can't just turn nested JSON into a flat, tabular structure like a table or pandas DataFrame as they are inherently different structures. Have a look [Python's JSON library](http://docs.python.org/2/library/json.html) and [pandas' read_json method](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.json.read_json.html). You're going to need to do some munging on the twitter data to get it into a tabular format. – Greg Reda Nov 04 '13 at 02:11

1 Answers1

0

I imagine there is a python library for this already, but I was able to get your tweet string to parse as a dictionary once I replaced these terms that appear unquoted.

 false to False 
 true to True
 null to None

I just assigned the whole bracketed expression to a variable, creating a dictionary. Then you can potentially go through and print the keys as a header and each value as an entry.

Fixing or quoting those three values might also make the pandas parser happier too, although I think a csv reader might cope with all the embedded commas and single and double quotes better. The JSON parser still choked on the URL having a colon, I think. You might try escaping they if you are going to try JSON.

beroe
  • 11,784
  • 5
  • 34
  • 79