I'm reading data from databases. I need to read from several servers (nodes) simultaniuosly, so I want to use pool.map.
I'm trying to do this way:
import pathos.pools as pp
import pandas as pd
import urllib
class DataProvider():
def __init__(self, hosts):
self.hosts_read = hosts
def read_data(self, host_index):
'''
Read data from current node
'''
limit = 1000000
host = self.hosts_read[host_index]
query = f"select FIELD1 from table_name limit {limit}"
url = urllib.parse.urlencode({'query': query})
df = pd.io.parsers.read_csv(f'http://{host}:8123/?{url}',
sep="\t", names=['FIELD1'], low_memory=False)
return df
def pool_read(self, num_workers):
'''
Read from data using Pool of workers.
Return list of lists - every list is a data from current worker.
'''
pool = pp.ProcessPool(num_workers)
result = pool.map(self.read_data, range(len(self.hosts_read)))
return result
if __name__ == '__main__':
provider = DataProvider(host=['server01.com', 'server02.com'])
data = provider.pool_read(num_workers=n_cpu)
It works perfect while limit is not so much (below 4 millions). And crushes if it is bigger:
multiprocess.pool.MaybeEncodingError: Error sending result: '[my_pandas_dataframe]'. Reason: 'error("'i' format requires -2147483648 <= number <= 2147483647")'
I found some answers about it: it is because we cannot return from the pool peace of data bigger than 2 GB. For example: SO link. But there is no any ideas or solutions, how to work if I need load bigger parts!
P.S. I use pathos module but it is not important here - the same error for multiprocessing module too.