I need to perform an "expensive" loop on a table with millions of rows. What technology (programming language or something else) should I use?

Question

This question is a follow up to this one: How to the increase performance of a Python loop?.

Basically I have a script that takes as inputs a few csv files and after some data manipulation it outputs 2 csv files. In this script there is a loop on a table with ~14 million rows whose objective is to create another table with the same number of rows. I am working with Python on this project but the loop is just too slow (I know this because I used the tqdm package to measure speed).

So I’m looking for suggestions on what I should use in order to achieve my objective. Ideally the technology is free and it doesn’t take long for to learn it. I already got a few suggestions from other people: Cython and Power BI. The last one is paid and the first one seems complicated but I am willing to learn if indeed it is useful.

If more details are necessary just ask. Thanks.

One important part: is each operation on a row local? In other words: can you do them one by one or do you need to know all input rows before you can produce the first output row? — Joachim Sauer, Apr 10 '21 at 08:49

score 3 · Answer 1 · answered Apr 10 '21 at 08:35

Read about vaex. Vaex can help you to process your data much much faster. You should first convert your csv file to hdf5 format using vaex library. csv files are very slow for read/write.

Vaex will do the multiprocessing for your operations.

Also check if you can vectorize your computation (probably you can). I glansed at your code. Try to avoid using list instead use numpy arrays if you can.

score 2 · Answer 2 · answered Apr 10 '21 at 08:29

If you're willing to stay with python I would probably recommend using the multiprocessing module. Corey Schafer has a good tutorial on how it works here.

Multiprocessing is a bit like threading however uses multiple interpreters to complete the main task unlike the threading module which switches through each thread to execute a line of code independently.

Divide up the work with however many cores you have on your CPU with this:

import os
cores = os.cpu_count()

This should speed up the work load by dividing up the work across your entire computer.

score 2 · Answer 3 · answered Apr 10 '21 at 09:16

2

14 million rows is very achievable with python, but you can't do it with inefficient looping methods. I had a glance at the code you posted here, and saw that you're using iterrows(). iterrows() is fine for small dataframes, but it is (as you know) painfully slow when used on dataframes the size of yours. Instead, I suggest you start by looking into the apply() method (see docs here). That should get you up to speed!

answered Apr 10 '21 at 09:16

Hassan A

337
2
10

I'm trying your suggestion but I have a problem. I do df.apply(..., axis = 0), where df is a DataFrame with 2 columns. The apply method applies the given function to each row, each column at a time. However, I need to be able to access values from the 2 columns of each row to do what I want. Is there a way to do this? – Wasonic Apr 12 '21 at 07:00
Should be possible! You should be able to find examples in this thread: https://stackoverflow.com/questions/19914937/applying-function-with-multiple-arguments-to-create-a-new-pandas-column – Hassan A Apr 13 '21 at 05:21

I need to perform an "expensive" loop on a table with millions of rows. What technology (programming language or something else) should I use?

3 Answers3