I'm just a Python newbie who's had fun dealing with data with Python.
When I was be able to use Python's representative data tool, Pandas, it seemed that it would be able to work on Excel very quickly.
However, I was somewhat disappointed to see it take more than 1 to 2 minutes to retrieve data(.xlsx) with 470,000 rows, and as a result, I found out that using modin and ray (or dask) would enable faster operation.
After learning how to use it simply as below, I compared it to using Pandas only. (this time, 100M rows data, about 5GB)
import ray
ray.init()
import modin.pandas as md
%%time
TB = md.read_csv('train.csv')
TB
But it only took 1 minute and 3 seconds to write Pandas, but it took 1 minute and 9 seconds to write modin [ray]. I was disappointed to see that it would take longer than just a small difference.
How can I use modin faster than pandas? Complex operations such as groupby or merge? Is there little difference in simply reading data?
Modin is faster to read data when other people are using it, is there something wrong with my computer's settings? I want to know why.
Write down the method installed at the prompt just in case you need it.
!pip install modin[ray]
!pip install ray[default]