4

I'm using modin.pandas to scale pandas for large dataset. However, when using pd.read_csv to load a 5 MB csv dataset in jupyter notebook to compare the performance of modin.pandas and pandas, it gives unexpected time duration of execution.

modin.pandas is taking more time than pandas. Why?

Code-

import modin.pandas as mpd
df = mpd.read_csv(r"C:\Downloads\annual-enterprise-survey-2019-financial-year-provisional-csv.csv")

import pandas as pd
df = pd.read_csv(r"C:\Downloads\annual-enterprise-survey-2019-financial-year-provisional-csv.csv")

Here's the link to the CSV file. I'm using modin version 0.8.3 and pandas version 1.1.5.

Output screenshot-

Jupyter notebook output

System information-

System information

Edit: I tried using a 500 MB CSV file and the result has just slightly improved. The execution time for both modin and pandas is almost same now. Is this usual?

Shradha
  • 2,232
  • 1
  • 14
  • 26
  • Try a bigger CSV file. 5MB might not be enough to make the speed increase gained by parallelizing the workload be greater than the overhead of setting up the parallelization. – Jack Taylor Jan 18 '21 at 15:10
  • [GitHub documentation](https://github.com/modin-project/modin#modin-is-a-dataframe-designed-for-datasets-from-1mb-to-1tb) says that it works even for 1 MB dataset. I'd expect the same execution time for pandas and modin for smaller datasets. Am I doing something wrong? – Shradha Jan 18 '21 at 20:01
  • 1
    "Works" is not the same thing as "is faster than". To make something run across several processors, there needs to be code that splits the work into chunks, sends it to each processor, and pieces the results back together when the processors are finished with each chunk. This does not come for free. The code has to run somewhere, so CPU time will necessarily be split between the parallelization overhead and the actual work. At some point there will be enough work to justify the overhead, but before that the single-threaded solution will be faster. – Jack Taylor Jan 19 '21 at 00:21
  • 1
    I don't think you are doing anything wrong; I just think that 5MB is probably not enough data to justify using modin. But that's just a guess. To prove it, try using a 500MB CSV file instead of a 5MB CSV file. (Just concatenating your existing data 100 times should do fine.) If that is slower in modin than in regular pandas, then you will know that you have a problem. – Jack Taylor Jan 19 '21 at 00:23
  • Please see my updated question. The performance has just slightly improved. – Shradha Jan 26 '21 at 20:39
  • What times did you get for the 500MB file? I'll see if I can reproduce your results later if I have time. – Jack Taylor Jan 27 '21 at 00:31
  • I have this exact issue with no fix as of yet, will just revert back to regular pandas. I have a 4 core i7 processor, a 4GB csv file, and the latest modin version installed from pipy. – PJ_ Oct 21 '22 at 20:29

1 Answers1

7

It appears that Modin does some initialisation the first time it runs, which would explain why your Modin time was slower than your Pandas time for the 5MB CSV file.

I investigated how long it took to load various sizes of CSV file on a system with four cores, with both Pandas and Modin. Here is the graph of the results for CSV files from 5MB to 100MB:

Graph of Pandas/Modin CSV file processing times for files from 5MB to 100MB

And for files up to 2GB:

Graph of Pandas/Modin CSV file processing times for files up to 2GB

The results show that, on the system tested:

  • Modin reads CSV files faster than Pandas for all file sizes except 5MB (the smallest file tested)
  • The difference in performance is greater the larger the file size
  • Modin took a disproportionately long time to process the 5MB file: 0.23s, compared to 0.12s for the 10MB file and 0.16s for the 15MB file

This is the code used to generate the results.

from pathlib import Path
from timeit import timeit

import modin.pandas as mpd
import pandas as pd

def create_input_file(filename, content, repetitions):
    path = Path(filename)
    if not path.exists():
        with path.open("a", encoding="utf-8") as f:
            for _ in range(repetitions):
                f.write(content)

def create_input_files(min_size, max_size, increment):
    content = Path("survey.csv").read_text(encoding="utf-8")
    for size in range(min_size, max_size + 1, increment):
        create_input_file(
            filename="survey{}MB.csv".format(size),
            content=content,
            repetitions=size // 5,
        )

def time_csv_read(module, filename, description):
    print(
        "{}: {:.2f} seconds".format(
            description,
            timeit(lambda: getattr(module, "read_csv")(filename), number=1)
        )
    )

def time_csv_reads(min_size, max_size, increment):
    for size in range(min_size, max_size + 1, increment):
        time_csv_read(pd, "survey{}MB.csv".format(size), "Pandas {}MB".format(size))
        time_csv_read(mpd, "survey{}MB.csv".format(size), "Modin {}MB".format(size))

def main():
    min_size1 = 5
    max_size1 = 95
    increment1 = 5
    min_size2 = 100
    max_size2 = 2000
    increment2 = 100
    create_input_files(min_size1, max_size1, increment1)
    create_input_files(min_size2, max_size2, increment2)
    time_csv_reads(min_size1, max_size1, increment1)
    time_csv_reads(min_size2, max_size2, increment2)

if __name__ == "__main__":
    main()

And here is the raw output (with warning messages removed):

Pandas 5MB: 0.12 seconds
Modin 5MB: 0.23 seconds
Pandas 10MB: 0.13 seconds
Modin 10MB: 0.12 seconds
Pandas 15MB: 0.19 seconds
Modin 15MB: 0.16 seconds
Pandas 20MB: 0.24 seconds
Modin 20MB: 0.20 seconds
Pandas 25MB: 0.31 seconds
Modin 25MB: 0.25 seconds
Pandas 30MB: 0.37 seconds
Modin 30MB: 0.29 seconds
Pandas 35MB: 0.40 seconds
Modin 35MB: 0.34 seconds
Pandas 40MB: 0.45 seconds
Modin 40MB: 0.37 seconds
Pandas 45MB: 0.51 seconds
Modin 45MB: 0.42 seconds
Pandas 50MB: 0.55 seconds
Modin 50MB: 0.46 seconds
Pandas 55MB: 0.62 seconds
Modin 55MB: 0.50 seconds
Pandas 60MB: 0.67 seconds
Modin 60MB: 0.53 seconds
Pandas 65MB: 0.74 seconds
Modin 65MB: 0.57 seconds
Pandas 70MB: 0.76 seconds
Modin 70MB: 0.61 seconds
Pandas 75MB: 0.87 seconds
Modin 75MB: 0.65 seconds
Pandas 80MB: 0.90 seconds
Modin 80MB: 0.67 seconds
Pandas 85MB: 0.93 seconds
Modin 85MB: 0.73 seconds
Pandas 90MB: 0.97 seconds
Modin 90MB: 0.74 seconds
Pandas 95MB: 1.34 seconds
Modin 95MB: 0.80 seconds
Pandas 100MB: 1.11 seconds
Modin 100MB: 0.83 seconds
Pandas 200MB: 2.21 seconds
Modin 200MB: 1.62 seconds
Pandas 300MB: 3.28 seconds
Modin 300MB: 2.40 seconds
Pandas 400MB: 5.48 seconds
Modin 400MB: 3.25 seconds
Pandas 500MB: 8.61 seconds
Modin 500MB: 3.92 seconds
Pandas 600MB: 8.11 seconds
Modin 600MB: 4.64 seconds
Pandas 700MB: 9.48 seconds
Modin 700MB: 5.70 seconds
Pandas 800MB: 11.40 seconds
Modin 800MB: 6.35 seconds
Pandas 900MB: 12.63 seconds
Modin 900MB: 7.17 seconds
Pandas 1000MB: 13.59 seconds
Modin 1000MB: 7.91 seconds
Pandas 1100MB: 14.84 seconds
Modin 1100MB: 8.63 seconds
Pandas 1200MB: 17.27 seconds
Modin 1200MB: 9.42 seconds
Pandas 1300MB: 17.77 seconds
Modin 1300MB: 10.22 seconds
Pandas 1400MB: 19.38 seconds
Modin 1400MB: 11.15 seconds
Pandas 1500MB: 21.77 seconds
Modin 1500MB: 11.98 seconds
Pandas 1600MB: 26.79 seconds
Modin 1600MB: 12.55 seconds
Pandas 1700MB: 23.55 seconds
Modin 1700MB: 13.66 seconds
Pandas 1800MB: 26.41 seconds
Modin 1800MB: 13.89 seconds
Pandas 1900MB: 28.44 seconds
Modin 1900MB: 15.15 seconds
Pandas 2000MB: 30.58 seconds
Modin 2000MB: 15.71 seconds

The fact that Modin processed the 10MB file faster than the 5MB file suggested to me that Modin does some initialisation work the first time it runs, so I tested this theory by reading the same 5MB file multiple times. The first time took 0.28 seconds, and subsequent times all took 0.08 seconds. You should see a similar difference in performance if you run Modin multiple times in the same Python process.

This initialisation work is different from the type of overhead I was talking about in my comment on your question. I was thinking of code that splits the work into chunks, sends it to each processor, and pieces the results back together when the processors are finished with each chunk. This kind of overhead will occur every time Modin reads a CSV file; the extra work that Modin does the first time it runs must be something else. So once Modin has done its initialisation, it will be worth using it even for files as small as 5MB. For files smaller than that, the kind of overhead I was talking about will likely become a factor, but it will take more investigation to know how small the files need to be for it to make a difference.

Jack Taylor
  • 5,588
  • 19
  • 35