2

I'm trying to speed up my code using parallel processing with the modin library.

I tried to do it with the dask engine on my Windows 10 computer but it didn't work, I thought that it is because it is still under development. I read that you can't use the ray engine on Windows so I'm running an easy example to check how the library works on a free AWS Ubuntu server.

When I try to install the modin package after I successfully installed ray and pandas packages I get the following error:

ERROR: Could not find a version that satisfies the requirement pandas==1.0.3 (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0, 0.23.0rc2, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0rc1, 0.24.0, 0.24.1, 0.24.2)
ERROR: No matching distribution found for pandas==1.0.3

If I type on the terminal pip3 install -vvv modin to get the logs I get:

Exception information:
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/cli/base_command.py", line 188, in _main
    status = self.run(options, args)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/cli/req_command.py", line 185, in wrapper
    return func(self, options, args)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/commands/install.py", line 333, in run
    reqs, check_supported_wheels=not options.target_dir
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 179, in resolve
    discovered_reqs.extend(self._resolve_one(requirement_set, req))
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 362, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 313, in _get_abstract_dist_for
    self._populate_link(req)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 279, in _populate_link
    req.link = self.finder.find_requirement(req, upgrade)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/index/package_finder.py", line 930, in find_requirement
    req)
pip._internal.exceptions.DistributionNotFound: No matching distribution found for pandas==1.0.3 (from modin)
Removed build tracker: '/tmp/pip-req-tracker-oklngevc'

How can I solve this problem?

The script I want to run to check how it works is:

import os
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
import modin.pandas as pd
import time
import pandas as pn

start_time = time.time()
datos = pd.read_csv('datospruebaAWS.csv', header=None, index_col=0)
end_time = time.time()
print("time read csv parallel=", end_time - start_time)

start_time = time.time()
datos = pn.read_csv('datospruebaAWS.csv', header=None, index_col=0)
end_time = time.time()
print("time read csv=", end_time - start_time)

and one of the scripts that I want to speed up, just changing import pandas as pd by import modin.pandas as pd is:

import pandas as pd
import glob
import time

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

start_time = time.time()
cookies = []
for file in all_filenames:
    datos = pd.read_csv(file, header=None, index_col=0)
    datos.index.name = 'CookieID'
    print('leido')
    for i in range(len(datos)):
        if datos[2].iloc[i].find('golf') != -1:
            cookies.append(datos.index[i])
    print('cookies')
    print(len(cookies))
    del datos

end_time = time.time()
print("time=", end_time - start_time)

cookies = pd.Series(cookies)
cookies = cookies.unique()
cookies = pd.DataFrame(cookies)
cookies['Owner ID'] = ['Les gusta el golf']*len(cookies)
cookies.to_csv('DMP_golf.txt', header=False, index=False, sep='\t')

because the folder has many large csv files and it takes hours to find the solution.

Also, are there other ways to speed up this code?

halfer
  • 19,824
  • 17
  • 99
  • 186
Geno
  • 21
  • 1
  • 3

1 Answers1

3

Looks like Pandas 1.0.3 does not support Python 3.5, which you are using. See the "version" column in https://pypi.org/project/pandas/1.0.3/#files.

Robert Nishihara
  • 3,276
  • 16
  • 17
  • Hello Robert, thanks for replying. I had 2 more problems besides that. The first one is that even though i had installed python version 3.7.8 on my server, when I run the command `sudo apt-get install python3-pip` it was executed with python version 3.5.2 although I set version 3.7 as default. I solved this problem with [loved.by.Jesus reply](https://stackoverflow.com/questions/54509031/pip-for-python3-7-ubuntu-16-04). The otherone is that I was working with a 1 GiB and 1 CPu ubuntu server, that's why multiprocessing didn't work. Then I tried with t3.xlarge AWS server and it worked. Thanks :) – Geno Jul 22 '20 at 06:35