Looking for a python datastructure for cleaning/annotating large datasets

Question

I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!

Desired functionality:

>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)

UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.

fwiw, you don't have to read it all into memory at once. you can use chunk_size, e.g. did you see this? http://stackoverflow.com/questions/11622652/large-persistent-dataframe-in-pandas — szeitlin, May 01 '15 at 20:32
Thanks for the link! I don't see any option in read_sql that gives me an iterator :/ http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.sql.read_sql.html — Selah, May 01 '15 at 20:48
read_sql has chunksize too: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.sql.read_sql.html?highlight=chunksize — TomAugspurger, May 01 '15 at 21:07

score 0 · Accepted Answer · answered May 01 '15 at 21:47

0

As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.

dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
    clean_df = ...  # whatever munging you have to do
    store.append("df", clean_df)

(see hdf5 section of the docs), or

dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
    clean_df = ...
    clean_df.to_sql(..., if_exists='append')

see the sql section of the docs.

answered May 01 '15 at 21:47

Andy Hayden

359,921
101
625
535

So would it be inefficient to read things at chunksize =1 with pandas in practice? – Selah May 02 '15 at 13:23
@Selah that would be terrible. Chunksize needs to be pretty big IIRC a minimum of 50000 is a good rule of thumb. But try a few different values and compare (with %timeit). – Andy Hayden May 02 '15 at 17:17

Looking for a python datastructure for cleaning/annotating large datasets

1 Answers1