1

Age-old problem it seems. I have googled around for a solution but there doesn't seem to be any straightforward one. What I would like is a way of install python and dependencies (like pandas, numpy and such that depend on these and are not in default anaconda installation) on all nodes on a hadoop cluster.

What I found is:

Easiest way to install Python dependencies on Spark executor nodes?

Shipping Python modules in pyspark to other nodes

Using egg certainly doesn't work in this scenario and installing manually on each node is exactly what I want to avoid because at some point you will also want to update everything and repeating that each 3 month or so just doesn't seem efficient.

Since these post were made, have there been any new developments regarding this issue (tools)? Other options?

EDIT December 19th 2018:

This was for a Big Data education and we ended up using parallel-ssh.

With it you can create your cli install script. In our case we downloaded and installed anaconda and then installed needed packages. That worked fine however spark configuration must be adjusted (if already installed) to use this new version of python. Of course this could also be done by editing or replacing the files.

All in all there are a lot of deep rabbit holes and there is probably no way around to relying on DevOps or if that is not possible learn Ansible (which we wanted to avoid as it more or less is another new language and tool to learn).

beginner_
  • 7,230
  • 18
  • 70
  • 127

0 Answers0