1

I want to do some parallel computing for the first time and I don't know exactly where should I start from.

The problem is that I have a huge file list (around 7000 csv files) that I want to process and get a single file from the data. For this task I would like to use the campus' cluster which works with Torque PBS.

The closest question to what I want to achive that I've found in SO so far is this one. With the main difference that I should use Torque (do I really?).

So, to leave it short, my question would be: How could I implement the solution of the cited question using Torque PBS?

  • Your campus cluster should come with some documentation to get you started. Try to run simple jobs on the cluster first. Regarding your question, can you process each file independently and concatenate the results into one single file? – Dima Chubarov Jul 18 '18 at 14:16
  • Are you asking for help writing parallel code or submitting a job to Torque? Torque is just a way to submit a job to run on one or more nodes in a managed cluster. – dbeer Aug 08 '18 at 18:31

1 Answers1

1

Well, I managed to do it the following way:

Assuming there's a python serial process named process.py which handles 100 of the csv files at a time.

Then we need a file call_pyprocess.pbs which calls the process.py with the following syntax:

#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS -o out.varx
#PBS -e error.varx

source activate p2.7    """ if need to specify python environment  """

python /path/to/file/process.py varx   """ varx is the iteration number """

Note that the process.py file require an argument parser in order to use varx as an internal variable.

Then the job is sent with the following command from bash:

for i in {00..70} ; do cp call_pyprocess.pbs temp.pbs ;
 perl -pi -e "s/varx/$i/" temp.pbs; qsub temp.pbs; done