1

This question is a follow up to this one: How to the increase performance of a Python loop?.

Basically I have a script that takes as inputs a few csv files and after some data manipulation it outputs 2 csv files. In this script there is a loop on a table with ~14 million rows whose objective is to create another table with the same number of rows. I am working with Python on this project but the loop is just too slow (I know this because I used the tqdm package to measure speed).

So I’m looking for suggestions on what I should use in order to achieve my objective. Ideally the technology is free and it doesn’t take long for to learn it. I already got a few suggestions from other people: Cython and Power BI. The last one is paid and the first one seems complicated but I am willing to learn if indeed it is useful.

If more details are necessary just ask. Thanks.

Wasonic
  • 39
  • 6

3 Answers3

3

Read about vaex. Vaex can help you to process your data much much faster. You should first convert your csv file to hdf5 format using vaex library. csv files are very slow for read/write.

Vaex will do the multiprocessing for your operations.

Also check if you can vectorize your computation (probably you can). I glansed at your code. Try to avoid using list instead use numpy arrays if you can.

RookieScientist
  • 314
  • 2
  • 12
2

If you're willing to stay with python I would probably recommend using the multiprocessing module. Corey Schafer has a good tutorial on how it works here.

Multiprocessing is a bit like threading however uses multiple interpreters to complete the main task unlike the threading module which switches through each thread to execute a line of code independently.

Divide up the work with however many cores you have on your CPU with this:

import os
cores = os.cpu_count()

This should speed up the work load by dividing up the work across your entire computer.

2

14 million rows is very achievable with python, but you can't do it with inefficient looping methods. I had a glance at the code you posted here, and saw that you're using iterrows(). iterrows() is fine for small dataframes, but it is (as you know) painfully slow when used on dataframes the size of yours. Instead, I suggest you start by looking into the apply() method (see docs here). That should get you up to speed!

Hassan A
  • 337
  • 2
  • 10
  • I'm trying your suggestion but I have a problem. I do df.apply(..., axis = 0), where df is a DataFrame with 2 columns. The apply method applies the given function to each row, each column at a time. However, I need to be able to access values from the 2 columns of each row to do what I want. Is there a way to do this? – Wasonic Apr 12 '21 at 07:00
  • Should be possible! You should be able to find examples in this thread: https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column – Hassan A Apr 13 '21 at 05:21