1

Sorry, a bit of a newbie with Python.

Can anyone help with the below code? I'm trying to write two dataframes, created by two separate multiprocessing Processes to the same excel file.

EDIT: this is simplified code. In my actual project the dataframes are built using pd.read_sql() on different connections. If this won't bring about any noticeable in terms of speed, please let me know. I just assumed running it normally would mean waiting for the first connection's SQL query to run before the second connection's.

import pyodbc
import pandas as pd
import os
from datetime import datetime
import multiprocessing

def Test1():
global df
df = pd.DataFrame({'Data': [10, 20, 30, 20, 15, 30, 45]})

def Test2():
    global df2
    df2 = pd.DataFrame({'Data': [20, 40, 60, 40, 30, 60, 90]})

if __name__ == '__main__':
Proc1 = multiprocessing.Process(target=Test1)
Proc2 = multiprocessing.Process(target=Test2)
Proc1.start()
Proc2.start()
Proc1.join()
Proc2.join()
writer = 
pd.ExcelWriter(os.path.join(os.path.join(os.environ['USERPROFILE']), 'Desktop','Test.xlsx') , engine='xlsxwriter')
df.to_excel(writer, sheet_name='Test Title',index=False)
df2.to_excel(writer,sheet_name='Test Title2',index=False)
workbook  = writer.book
worksheet = writer.sheets['Test Title']
worksheet = writer.sheets['Test Title2']
writer.save()

It doesn't help that I don't know the terminology in order to search out the answer. So apologies if this is a duplicate of a question asked by someone more Python-literate than myself.

Also, the error message:

line 37, in <module>
df.to_excel(writer, sheet_name='Test Title',index=False)
NameError: name 'df' is not defined
Andy
  • 65
  • 1
  • 1
  • 5
  • I can't really see the use-case of multiprocessing here. It's probably slower than serially creating the DFs – roganjosh Dec 29 '18 at 14:48
  • Please check where your bottleneck is. Last time I worked with excel in python the problem was with the library/package and there was no need for multiprocessing. Just for a new package. – not_a_bot_no_really_82353 Dec 29 '18 at 15:07
  • 1
    Also if you have two processes for one file they might block each other. – not_a_bot_no_really_82353 Dec 29 '18 at 15:07
  • @roganjosh, sorry, this is a trimmed down version of the code I'm actually using. Each of the dataframes are actually created by SQL queries that take a while to run. So I'm trying to run them concurrently – Andy Dec 29 '18 at 16:01

1 Answers1

0

Using global with multiprocessing does not work like you seem to think that it does.

Both Test1 and Test2 are run in a separate process, that shares no data with the parent process.

Some things that you could do:

  • Create a multiprocessing.Queue before starting the processes. The processes can then use that Queue to send data back to the parent process. The data probably has to be able to be pickled. The parent writes the data to the excel file.
  • Have each process write its own dataframe to a file and then send a message to the parent that it is done. The parent merges them into an excel file when all processes are done. A variant of this is to use memory mapped files (mmap). The latter works well if the amount of data does not ecxeed available memory.

Which approach works best depends on a lot of factors. For small amounts of data using a Queue is a good option. For large amounts of data using intermediate files (especially on an SSD) can be a good option.

Roland Smith
  • 42,427
  • 3
  • 64
  • 94