0

I am running the following code on about 6 million rows. It's so slow and never ends.

df['City'] = df['POSTAL_CODE'].apply(lambda x: nomi.query_postal_code(x).county_name)

It assigns a corresponding city to each postal code. When I run it on a slice of dateset(e.g, 1000 rows) it works well. But running the code on the whole data never gives me any output.

Can anyone modify the code to make it faster?

Thank you!

Benn
  • 59
  • 7
  • 1
    Does this answer your question? [Make Pandas DataFrame apply() use all cores?](https://stackoverflow.com/questions/45545110/make-pandas-dataframe-apply-use-all-cores) – Davide Fiocco Jun 05 '20 at 15:52
  • 1
    To have some sort of visual feedback on progress, you could consider `progress_apply` with `tqdm` https://stackoverflow.com/questions/18603270/progress-indicator-during-pandas-operations – Davide Fiocco Jun 05 '20 at 15:54
  • My work computer denied installation of Swifter, so I can't use it. Thanks for your help though – Benn Jun 05 '20 at 16:36

1 Answers1

0
!pip3 install multiprocess

from multiprocess import Pool

def parallelize_dataframe(data, func, n_cores=4):
       data_split = np.array_split(data, n_cores)
       pool = Pool(n_cores)
       data = pd.concat(pool.map(func, data_split))
       pool.close()
       pool.join()
       return data


df['City'] = parallelize_dataframe(df['POSTAL_CODE'], lambda x: nomi.query_postal_code(x).county_name, 4)
DejaVuSansMono
  • 787
  • 5
  • 14
  • Can you add more info please. In my case, data is df, and function is nomi.query_postal_code(x) . I don't know how to plug in my df info into your proposed solution. Thanks if you can provide some help. – Benn Jun 05 '20 at 16:39
  • @Benn Added. This should work. There are other ways to do this but this is pretty easy. – DejaVuSansMono Jun 05 '20 at 17:32