1

Overview:
I am trying to create a large number of variables (wide dataset) for a financial data analysis project. I have a pandas dataframe "position_history" with 140,000 rows, each containing a stock and buy and sell dates/prices.

I have a function create_domain that takes inputs (stock, buy_date, sell_date) and:

  1. makes a query to my SQLite3 database to extract the time-series for that stock given those dates.
  2. constructs my variables using the time-series

I apply the function create_domain to position_history using df.apply

When I run my code sequentially it takes about 4 hours to construct my variables, and I would like to speed this up using multiple processes, since I will have to do this many times and may need a much wider dataset.

For multiple processes, I split position_history into chunks vertically, creating a list of dataframes. I pass this list to joblib (multiprocessing). My code almost always hangs indefinitely without throwing any errors (but sometimes runs on small samples).

My suspicion has been that there is a problem arising from my worker processes trying to read the same SQL table concurrently.

I have tried the following remedies:

Related stackoverflow entries:

(Psuedo) Code snippets
My call to joblib:

x = Parallel(n_jobs =4)(delayed(create_domain)(chunk, other inputs) for chunk in chunks)  
# where each chunk is a portion of the position_history df

My create_domain function:

def create_domain (df=position_history, inputs):  
    # create vars using row x of position_history  
    f = lambda x: sql_query_and_create_vars(inputs, x['column'])  
    result = df.apply(f, axis=1)  
    return result

Summary: My code hangs forever, crashes the kernel, and doesn't raise any errors. I would appreciate any insight as to:

  • Why this is happening?
  • How can I fix it?
  • Is there simply a better way to do what I'm trying to do? I did my best to vectorize, and optimize my SQL queries.
  • Is this a problem with SQLite3? Would something like MySQL work better?
  • Any tips; I'm new to coding/python/data science.

Details: I am on a super computer cluster, running Linux, and using iPython on Python 3.4.3.

This is my first stackoverflow question -- sorry in advance for faux pas and thanks for your help!

Community
  • 1
  • 1
waywocket
  • 11
  • 3
  • It sounds like your app is severely IO-bound. Having multiple processes in this case would likely not help. It sounds like you're attempting to execute 140k queries, which *of course* is going to take a *really* long time. The way to fix this is to batch your queries. Instead of executing a query for each of the 140k, see if you can execute one query per chunk. – univerio Aug 16 '16 at 21:06
  • Hi univerio, thanks for your comment. Yes, most of the time is spent on the SQL queries according to the light profiling I've done. I could reduce the number of queries, and do more of the processing on the pandas side (in memory). For example, I could make one query to pull the time-series for many stocks, and then split those into individual stocks once I've got them in the pandas df. Is pandas generally faster than SQL for this? – waywocket Aug 16 '16 at 21:26
  • Yes, for the simple reason that you're using memory instead of disk. You only need to use the disk if you don't have enough memory. – univerio Aug 16 '16 at 22:21
  • Ok, I'll try this: I'll pull the whole stocks table into a dataframe, then I'll put that dataframe into a SQLite database that exists only in memory (https://www.sqlite.org/inmemorydb.html). Then I should be able to use my existing SQL code and gain the advantage of having everything in mem; and maybe multiprocessing will work too. Thanks again. – waywocket Aug 17 '16 at 00:57
  • If all DB accesses are read only, try using the [`unix-none` VFS](http://www.sqlite.org/vfs.html) to remove all locking overhead. – CL. Aug 17 '16 at 07:00
  • Update -- I took univerio's advice and tried a different approach that ended up working out (though I did not solve the SQL multiprocessing problem). I did try to use an in-memory :memory: SQLite DB, which sped things up by about 15% but still couldn't handle multiprocessing the way I had it set-up. Then, I moved over to pandas: I put the entire stock table into a pandas dataframe and "queried" that instead, and that worked with multiprocessing. Thanks for the tips. – waywocket Aug 19 '16 at 20:06

0 Answers0