I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!
Desired functionality:
>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)
UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.