2

Currently, I'm running Spark Cluster with Standalone mode using docker on a physical machine with 16GB RAM Ubuntu 16.04.1 x64

RAM configuration of Spark Cluster containers: master 4g, slave1 2g, slave2 2g, slave3 2g

docker run -itd --net spark -m 4g -p 8080:8080 --name master --hostname master MyAccount/spark &> /dev/null
docker run -itd --net spark -m 2g -p 8080:8080 --name slave1 --hostname slave1 MyAccount/spark &> /dev/null
docker run -itd --net spark -m 2g -p 8080:8080 --name slave2 --hostname slave2 MyAccount/spark &> /dev/null
docker run -itd --net spark -m 2g -p 8080:8080 --name slave3 --hostname slave3 MyAccount/spark &> /dev/null
docker exec -it master sh -c 'service ssh start' > /dev/null
docker exec -it slave1 sh -c 'service ssh start' > /dev/null
docker exec -it slave2 sh -c 'service ssh start' > /dev/null
docker exec -it slave3 sh -c 'service ssh start' > /dev/null
docker exec -it master sh -c '/usr/local/spark/sbin/start-all.sh' > /dev/null

There are about 170GB data in my MongoDB database. I ran MongoDB using ./mongod without any replication and shard on local host not using docker.

Using Stratio/Spark-Mongodb Connector

Following commands I ran on "master" container:

/usr/local/spark/bin/spark-submit --master spark://master:7077 --executor-memory 2g --executor-cores 1 --packages com.stratio.datasource:spark-mongodb_2.11:0.12.0 code.py

code.py:

from pyspark import SparkContext
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.sql("CREATE TEMPORARY VIEW tmp_tb USING com.stratio.datasource.mongodb OPTIONS (host 'MyPublicIP:27017', database 'firewall', collection 'log_data')")
df = spark.sql("select * from tmp_tb")
df.show()

I modified ulimit values in /etc/security/limits.conf and /etc/security/limits.d/20-nproc.conf

* soft nofile unlimited
* hard nofile 131072
* soft nproc unlimited
* hard nproc unlimited
* soft fsize unlimited
* hard fsize unlimited
* soft memlock unlimited
* hard memlock unlimited
* soft cpu unlimited
* hard cpu unlimited
* soft as unlimited
* hard as unlimited

root soft nofile unlimited
root hard nofile 131072
root soft nproc unlimited
root hard nproc unlimited
root soft fsize unlimited
root hard fsize unlimited
root soft memlock unlimited
root hard memlock unlimited
root soft cpu unlimited
root hard cpu unlimited
root soft as unlimited
root hard as unlimited

$ ulimit -a

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 63682
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 131072
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Also, add

kernel.pid_max=200000
vm.max_map_count=600000

in /etc/sysctl.conf

Then, after reboot and run spark program again.

I still have following errors saying pthread_create failed: Resource temporarily unavailable and com.mongodb.MongoException$Network: Exception opening the socket.

Errors Snapshot:

pyspark error

mongodb error

Is the physical memory not enough? or which part of config I did it wrong?

Thanks.

yivi
  • 42,438
  • 18
  • 116
  • 138
chming1016
  • 21
  • 5

0 Answers0