0

I am quite new to python.I have been thinking of making the below code to parellel calls where a list of doj values are formatted with help of lambda,

m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)

def formatdoj(doj):
    doj = str(doj).split(" ")[0]
    doj = datetime.strptime(doj, '%Y' + "-" + '%m' + "-" + "%d")
    return doj

Since the list has million records, the time it takes to format all takes a lot of time.

How to make parellel function call in python similar to Parellel.Foreach in c#?

kaarthick raman
  • 41
  • 1
  • 1
  • 3
  • 1
    Does this already answer your question? https://stackoverflow.com/questions/7207309/python-how-can-i-run-python-functions-in-parallel – Corey Taylor Jun 13 '18 at 08:32
  • 1
    Possible duplicate of [C# Parallel.Foreach equivalent in Python](https://stackoverflow.com/questions/29236642/c-sharp-parallel-foreach-equivalent-in-python) – ndrwnaguib Jun 13 '18 at 08:34

3 Answers3

1

I think that in your case using parallel computation is a bit of an overkill. The slowness comes from the code, not from using a single processor. I'll show you in some steps how to make it faster, guessing a bit that you're working with a Pandas dataframe and what your dataframe contains (please stick to SO guidelines and include a complete working example!!)

For my test, I've used the following random dataframe with 100k rows (scale times up to get to your case):

N=int(1e5)
m_df = pd.DataFrame([['{}-{}-{}'.format(y,m,d)]
                        for y,m,d in zip(np.random.randint(2007,2019,N),
                        np.random.randint(1,13,N),
                        np.random.randint(1,28,N))],
                    columns=['doj'])

Now this is your code:

tstart = time()
m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)
print("Done in {:.3f}s".format(time()-tstart))

On my machine it runs in around 5.1s. It has several problems. The first one is you're using dataframes instead of series, although you work only on one column, and creating a useless lambda function. Simply doing:

m_df['doj'].apply(formatdoj)

Cuts down the time to 1.6s. Also joining strings with '+' is slow in python, you can change your formatdoj to:

def faster_formatdoj(doj):
    return datetime.strptime(doj.split()[0], '%Y-%m-%d')
m_df['doj'] = m_df['doj'].apply(faster_formatdoj)

This is not a great improvement but does cut down a bit to 1.5s. If you need to join the strings for real (because e.g. they are not fixed), rather use '-'.join('%Y','%m','%d'), that's faster.

But the true bottleneck comes from using datetime.strptime a lot of times. It is intrinsically a slow command - dates are a bulky thing. On the other hand, if you have millions of dates, and assuming they're not uniformly spread since the beginning of humankind, chances are they are massively duplicated. So the following is how you should truly do it:

tstart = time()
# Create a new column with only the first word
m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0])
converter = {
    x: faster_formatdoj(x) for x in m_df['doj_split'].unique()
}
m_df['doj'] = m_df['doj_split'].apply(lambda x: converter[x])
# Drop the column we added
m_df.drop(['doj_split'], axis=1, inplace=True)
print("Done in {:.3f}s".format(time()-tstart))

This works in around 0.2/0.3s, more than 10 times faster than your original implementation.

After all this, if you still are running to slow, you can consider working in parallel (rather parallelizing separately the first "split" instruction and, maybe, the apply-lambda part, otherwise you'd be creating many different "converter" dictionaries nullifying the gain). But I'd take that as a last step rather than the first solution...

[EDIT]: Originally in the first step of the last code box I used m_df['doj_split'] = m_df['doj'].str.split().apply(lambda x: x[0]) which is functionally equivalent but a bit slower than m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0]). I'm not entirely sure why, probably because it's essentially applying two functions instead of one.

Marco Spinaci
  • 1,750
  • 15
  • 22
0

Your best bet is to use dask. Dask has a data_frame type which you can use to create this a similar dataframe, but, while executing compute function, you can specify number of cores with num_worker argument. this will parallelize the task

rawwar
  • 4,834
  • 9
  • 32
  • 57
0

Since I'm not sure about your example, I will give you another one using the multiprocessing library:

# -*- coding: utf-8 -*-
import multiprocessing as mp

input_list = ["str1", "str2", "str3", "str4"]

def format_str(str_input):
    str_output = str_input + "_test"
    return str_output

if __name__ == '__main__':
    with mp.Pool(processes = 2) as p:
        result = p.map(format_str, input_list)

    print (result)

Now, let's say you want to map a function with several arguments, you should then use starmap():

# -*- coding: utf-8 -*-
import multiprocessing as mp

input_list = ["str1", "str2", "str3", "str4"]

def format_str(str_input, i):
    str_output = str_input + "_test" + str(i)
    return str_output

if __name__ == '__main__':
    with mp.Pool(processes = 2) as p:
        result = p.starmap(format_str, [(input_list, i) for i in range(len(input_list))])

    print (result)

Do not forget to place the Pool inside the if __name__ == '__main__': and that multiprocessing will not work inside an IDE such as spyder (or others), thus you'll need to run the script in the cmd.

To keep the results, you can either save them to a file, or keep the cmd open at the end with os.system("pause") (Windows) or an input() on Linux.

It's a fairly simple way to use multiprocessing with python.

Mathieu
  • 5,410
  • 6
  • 28
  • 55