I have a bash shell script that runs about 70 instances of a python application. Each python instance run TensorFlow 2.0 which wakes up once per hour and does some work. The bash shell script runs fine in the user shell but core dumps after the 36th instance of the job when running in cron.
I have the shell script setup to fully qualify the paths and have validated that the environments are identical in both instances.
This runs on a 36 core machine running Ubuntu on AWS: #56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
It appears that there is some limit to the number of "Tasks" that cron can run.
Is there a setting to change the number of Tasks allowed in cron?
Here is the crontab entry:
*/5 * * * * /myscripts/watchdog.sh >> /myscripts/watchdog.log 2>&1
So this runs every 5 minutes check for the running processes. If they're not running then it starts them.
#!/bin/bash
# https://serverfault.com/questions/710847/how-to-apply-memory-limits-to-all-cron-jobs
# checking the cron ulimit
# systemctl status cron
# more /etc/pam.d/cron
# talking about /etc/security/limits.conf
export PATH=/runner/venv/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
/bin/echo "##################### watchdog.sh running now #####################"
/bin/date
export LANG=C.UTF-8
export USER=ubuntu
export HOME=/home/ubuntu
export MAIL=/var/mail/ubuntu
export SHELL=/bin/bash
export LOGNAME=ubuntu
# https://unix.stackexchange.com/questions/162104/how-to-change-the-kernel-max-pid-number
# pid_max is 4194304 for 64 bit
if grep -q 56000 /proc/sys/kernel/pid_max; then
/bin/echo "/proc/sys/kernel/pid_max = 56000"
else
/bin/echo 56000 | sudo tee /proc/sys/kernel/pid_max
fi
# https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt
if grep -q 48000 /sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max; then
/bin/echo "/sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max = 48000"
else
/bin/echo 48000 | /usr/bin/sudo tee /sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max
fi
export DEPLOY_ENV="system_one"
export VIRTUAL_ENV="/runner/venv"
hash -r
# see https://stackoverflow.com/questions/51256738/multiple-instances-of-python-running-simultaneously-limited-to-35
#export OPENBLAS_NUM_THREADS=1
#export OMP_NUM_THREADS=1
export AEP="/runner/analyzerengine"
export PID_FILE_DIR="/runner/pids"
export OUT_FILE_DIR="/runner/out"
while read producer; do
producer="$(/bin/echo $producer| /bin/sed 's/\r//g')"
export PIDFILE="${PID_FILE_DIR}/${producer}.pid"
/bin/echo "Checking producer=$producer in file $PIDFILE"
if [ -e "${PIDFILE}" ] && [ "$(/bin/ps -o pid= -p "$(/bin/sed 's/ //g' < "${PIDFILE}")")" ] ; then
/bin/echo "${producer} process PID check OK (running) on $(/bin/date) ."
else
/bin/echo "Restarting ${producer} process on $(/bin/date)..."
/bin/echo "executing: ${VIRTUAL_ENV}/bin/python ${AEP}/runnerCode.py --producer=${producer} --deployment=${DEPLOY_ENV} &> ${OUT_FILE_DIR}/${producer}.log &"
${VIRTUAL_ENV}/bin/python ${AEP}/runnerCode.py --producer=${producer} --deployment=${DEPLOY_ENV} > ${OUT_FILE_DIR}/${producer}.log &
/bin/echo $! > "${PIDFILE}"
/bin/chmod 644 ${OUT_FILE_DIR}/${producer}.log
/bin/chmod 644 "${PIDFILE}"
/bin/echo "...done."
fi
done < ${AEP}/producer_list.txt
Running the command: $ systemctl status cron
Produces the following output:
cron.service - Regular background program processing daemon
Loaded: loaded (/lib/systemd/system/cron.service; enabled; vendor preset: enabled)
Active: active (running) since Sun 2019-11-24 16:59:41 UTC; 2 days ago
Docs: man:cron(8)
Main PID: 1191 (cron)
Tasks: 5391 (limit: 5529)
CGroup: /system.slice/cron.service
├─ 1191 /usr/sbin/cron -f
├─40750 /runner/venv/bin/python /runner/analyzerengine/runnerCode.py --producter=customer_A --deployment=system_one
├─40791 /runner/venv/bin/python -c from multiprocessing.semaphore_tracker import main;main(3)
...
There are only 36 processes that will start with this script. When I run this script as a user, (username=ubuntu), I can get all 70 processes to start without a problem. Apparently there is some limit somewhere that is not set correctly.
Since each instance of runnerCode.py produces a couple hundred threads, (something built into TensorFlow that I cant control), I needed to set the /proc/sys/kernel/pid_max to 56000 and /sys/fs/cgroup/pids/user.slice/user-1000.slice/pids.max to 48000.
Is there some setting in systemctl that needs to be changes to enable more process to run?
Thanks in advance!