-2

I have code like:

import pandas as pd
import multiprocessing as mp

a = {'a' : [1,2,3,1,2,3], 'b' : [5,6,7,4,6,5], 'c' : ['dog', 'cat', 'tree','slow','fast','hurry']}
df = pd.DataFrame(a)

def performDBSCAN(feature): 
    value=scorecalculate(feature)
    print(value)
    for ele in range(4):
        value=value+1
        print('here value is ', value)
    return value

def processing(feature):
    result1=performDBSCAN(feature)
    return result1

def scorecalculate(feature):
    scorecal=0
    for val in ['a','b','c','d']:
        print('alpha is:', val )
        scorecal=scorecal+1
    return scorecal

columns = df.columns
for ele in df.columns:
    processing(ele)

The above code is executing in a serial fashion. I would like to make faster by processing each col in the parallel fashion by using python and I wrote the following code using multiprocessing but didn't help.

import pandas as pd
import multiprocessing as mp     

def performDBSCAN(feature): 
    value=scorecalculate(feature)
    print(value)
    for ele in range(4):
        value=value+1
        print('here value is ', value)
    return value

def scorecalculate(feature):
    scorecal=0
    for val in ['a','b','c','d']:
        print('alpha is:', val )
        scorecal=scorecal+1
    return scorecal

def processing(feature):
    result1=performDBSCAN(feature)
    return result1

a = {'a' : [1,2,3,1,2,3], 'b' : [5,6,7,4,6,5], 
'c' : ['dog','cat','tree','slow','fast','hurry']}
df = pd.DataFrame(a)
columns = df.columns
pool = mp.Pool(4)
resultpool = pool.map(processing, columns)

I couldn't see any output and the kernel is continuously running without any output? what could be the issue? Is there any other way of doing it by other libraries in numba? (Note: this code is an normal example. The basic idea is that i have to take each column in a dataframe and perform DBSCAN algorithm. Based on the result of DBSCAN, i have another function to calculate score for that. I gave these two funtions in the above code. Incrementing operations in the above functions are used for verifying to to see whether it goes to funtion or not. That was my intention. Here in first part of code it is taking in a serial fashion whereas i need to parallelise this area of for loop so that i can process multiple columns in parallel fashion).

Vas
  • 918
  • 1
  • 6
  • 19
  • 3
    I am completely stumped about what this code is supposed to be doing. – Jan Christoph Terasa Sep 14 '18 at 06:29
  • this code is an abstract example. The basic idea is that i have to take each column in a dataframe and perform DBSCAN algorithm. Based on the result of DBSCAN, i have another function to calculate score for that. That was my intention. Here in first part of code it is taking in a serial fashion whereas i need to parallelise this area of for loop so that i can process multiple columns in parallel fashion – Vas Sep 14 '18 at 06:38
  • You could try using the multiprocessing module https://docs.python.org/3.4/library/multiprocessing.html I have only used the threadpool from multiprocessing.pool but it parallelized my application quite effectively. I am in no way an expert on this. Just sharing what I found useful in my own project. – Ninad Gaikwad Sep 14 '18 at 06:41
  • Can you post an example which represents your actual problem? It completely depends on the problem whether this can be sped up by vectorization. Your example is not using the dataframe at all, it just prints some stuff on the screen. Please post an [MCVE](https://stackoverflow.com/help/mcve). – Jan Christoph Terasa Sep 14 '18 at 06:41
  • I used only vectorization methods in DBSCAN and in scoring scoring function. But i need to optimize this for loop, by executing it in parallel – Vas Sep 14 '18 at 06:44
  • I have a huge amount of lines of code in those two functions. That is the reason I am not able to write it here in an exact way. – Vas Sep 14 '18 at 06:51
  • I am still puzzled by the main idea. How can numbers and strings have a useful common operation which can be parallelized? But maybe [`pandas.DataFrame.applymap`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.applymap.html) is the thing you're looking for? – Jan Christoph Terasa Sep 14 '18 at 06:56

1 Answers1

1

You have to use if __name__ == '__main__': as stated in programming guidelines for multiprocessing module https://docs.python.org/3/library/multiprocessing.html#multiprocessing-programming. That is the second code you have provided should look like this:

# imports

# functions

if __name__ == '__main__':
    a = {
        'a': [1, 2, 3, 1, 2, 3],
        'b': [5, 6, 7, 4, 6, 5],
        'c': ['dog', 'cat', 'tree', 'slow', 'fast', 'hurry']}
    df = pd.DataFrame(a)
    pool = mp.Pool(4)
    result = pool.map(processing, df.columns)
    print(result)

Output:

[8, 8, 8]

EDIT:

To run the code in Jupyter Notebook you have to place your functions into a module (in the simplest case it is .py file in the folder where your .ipynb script located). Then you should import your custom module of course. This fixes the problem for me.

  • @Vamshi Yes. I ran your code and got infinite running time with no result. Then I made this correction and got the output `[8, 8, 8]` without infinite execution. I use Python 3.6 / Win10. –  Sep 14 '18 at 16:53
  • @Vamshi Just tried again on another machine (python 3.6 / win7). The result is the same - `if __name__ == '__main__':` fixes the problem. –  Sep 14 '18 at 17:34
  • I tried it. but it is not working on my system. It is windows 10 and python 3.6. I just want to see if it goes to the function. Then, the rest of the things, I can handle. – Vas Sep 14 '18 at 17:35
  • I think there is a problem with OS. Here I found something: https://stackoverflow.com/questions/32296037/python-multiprocessing-on-windows-10 – Vas Sep 14 '18 at 17:38
  • Did you use in python Notebook or something else ? – Vas Sep 14 '18 at 17:40
  • @Vamshi I use Anaconda and Win Command Prompt. Tried the code in Jupyter Notebook and got no result with infinite running time. So probably it is Jupyter related issue where processes do not know how they are related to the main process. –  Sep 14 '18 at 17:47
  • Yeah @Poolka. I have been trying it in Jupyter for the last 48 hours. That could be the issue. I try it in other IDE or in terminal. – Vas Sep 14 '18 at 17:49
  • @Vamshi I have edited my answer. Check it out please to solve the problem in Jupyter Notebook. –  Sep 14 '18 at 18:01
  • I found this which states that there is a problem from Jupyter https://stackoverflow.com/questions/47313732/jupyter-notebook-never-finishes-processing-using-multiprocessing-python-3/47374811#47374811. – Vas Sep 14 '18 at 18:01
  • @ Poolka Can you help me with this one?? https://stackoverflow.com/questions/52413152/how-can-i-make-my-program-to-use-multiple-cores-of-my-system-in-python?noredirect=1 – Vas Sep 19 '18 at 22:19