1

I'm doing a lot of cleaning, annotating and simple transformations on very large twitter datasets (~50M messages). I'm looking for some kind of datastructure that would contain column info the way pandas does, but works with iterators rather than reading the whole dataset into memory at once. I'm considering writing my own, but I wondered if there was something with similar functionality out there. I know I'm not the only one doing things like this!

Desired functionality:

>>> ds = DataStream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds.columns
['id', 'message']
>>> ds.iterator.next()
[2385, "Hi it's me, Sally!"]
>>> ds = datastream.read_sql("SELECT id, message from dataTable WHERE epoch < 129845")
>>> ds_tok = get_tokens(ds)
>>> ds_tok.columns
['message_id', 'token', 'n']
>>> ds_tok.iterator.next()
[2385, "Hi", 0]
>>> ds_tok.iterator.next()
[2385, "it's", 1]
>>> ds_tok.iterator.next()
[2385, "me", 2]
>>> ds_tok.to_sql(db_info)

UPDATE: I've settled on a combination of dict iterators and pandas dataframes to satisfy these needs.

Selah
  • 7,728
  • 9
  • 48
  • 60
  • 1
    fwiw, you don't have to read it all into memory at once. you can use chunk_size, e.g. did you see this? http://stackoverflow.com/questions/11622652/large-persistent-dataframe-in-pandas – szeitlin May 01 '15 at 20:32
  • Thanks for the link! I don't see any option in read_sql that gives me an iterator :/ http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.sql.read_sql.html – Selah May 01 '15 at 20:48
  • read_sql has chunksize too: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.sql.read_sql.html?highlight=chunksize – TomAugspurger May 01 '15 at 21:07
  • Good point. Thanks. – Selah May 02 '15 at 13:24

1 Answers1

0

As commented there is a chunksize argument for read_sql which means you can work on sql results piecemeal. I would probably use HDF5Store to save the intermediary results... or you could just append it back to another sql table.

dfs = pd.read_sql(..., chunksize=100000)
store = pd.HDF5Store("store.h5")
for df in dfs:
    clean_df = ...  # whatever munging you have to do
    store.append("df", clean_df)

(see hdf5 section of the docs), or

dfs = pd.read_sql(..., chunksize=100000)
for df in dfs:
    clean_df = ...
    clean_df.to_sql(..., if_exists='append')

see the sql section of the docs.

Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • So would it be inefficient to read things at chunksize =1 with pandas in practice? – Selah May 02 '15 at 13:23
  • @Selah that would be terrible. Chunksize needs to be pretty big IIRC a minimum of 50000 is a good rule of thumb. But try a few different values and compare (with %timeit). – Andy Hayden May 02 '15 at 17:17