Using Python 2.7 on a Windows machine, I have a large pandas
DataFrame (about 7 million rows and 20+ columns) from a SQL query that I'd like to filter by looping through IDs then run calculations on the resulting filtered data. I'd also like to do this in parallel.
I know that if I try to do this with standard methods from the multiprocessing
package in Windows, each process will generate a new instance of that large DataFrame for its own use and my memory will be eaten up. So I'm trying to use information I've read on remote managers to make my DataFrame a proxy object and share that across each process but I'm struggling to make it work.
My code is below, and I can get it to work on a single for loop no problem, but again the memory gets eaten up if I make it a parallel process:
import multiprocessing
import pandas
import pyodbc
def download(args):
"""pydobc code to download data from sql database"""
def calc(dataset, index):
filter_data = dataset[dataset['ID'] == index]
"""run calculations on filtered DataFrame"""
"""append results to local csv"""
if __name__ == '__main__':
data_1 = download(args_1)
data_2 = download(args_2)
all_data = data_1.append(data_2) #Append downloaded DataFrames into one
unique_id = pandas.unique(all_data['ID'])
pool = multiprocessing.Pool()
[pool.apply_async(calc, args=(all_data, x) ) for x in unique_id ]