Iteratively run a software on multi-resolution files

Question

.mcool files contain matrices for multiple resolutions.

Cooler for one .mcool file:

cooler ls ./../input/A001C007.hg38.nodups.pairs.mcool

./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/2000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/20000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/100000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/250000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000000
(EagleC)

For every file in ./input/*.mcool, if cooler ls "$mcool_file" ends with 5000, 10000, or 50000 after the last /, I want to run predictSV from EagleC.

As shown on the repo, a single .mcool file

predictSV --hic-5k SKNAS-MboI-allReps-filtered.mcool::/resolutions/5000 \
          --hic-10k SKNAS-MboI-allReps-filtered.mcool::/resolutions/10000 \
          --hic-50k SKNAS-MboI-allReps-filtered.mcool::/resolutions/50000 \
          -O SK-N-AS -g hg38 --balance-type CNV --output-format full \
          --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999

However, for a list of files, I need to write a for-loop to iteratively run predictSV.

My attempt:

#!/usr/bin/env bash
for mcool_file in input/*.mcool; do

  # iterate over ids emitted from cooler ls for this file
  hic5k_num=; hic10k_num=; hic50k_num=
  while IFS= read -r id; do
    id_suffix=${id##*/}
    case $id_suffix in
      5000)  hic5k_num=$id_suffix;;
      10000) hic10k_num=$id_suffix;;
      50000) hic50k_num=$id_suffix;;
    esac
  done < <(cooler ls "$mcool_file")

  echo predictSV \
   ${hic5k_num:+  --hic-5k "$hic5k_num"} \
   ${hic10k_num:+ --hic-10k "$hic10k_num"} \
   ${hic50k_num:+ --hic-50k "$hic50k_num"} \
   -g hg38 \
   -O "${mcool_file%%.*}" \
   --balance-type CNV \
   --output-format full \
   --prob-cutoff-5k 0.8 \
   --prob-cutoff-10k 0.8 \
   --prob-cutoff-50k 0.99999
done

However my output is:

predictSV --hic-5k 5000 --hic-10k 10000 --hic-50k 50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k 5000 --hic-10k 10000 --hic-50k 50000 -g hg38 -O input/A001C008 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999

Expected output:

predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C008 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999

Edit:

Furthermore, I want to change the -O part from "${mcool_file%%.*}" to the following: EagleC_output/ + substring after the / parenthesis in mcool_file file. For example, EagleC_output/A001C007.

`hic5k_num=$id_suffix` should be `hic5k_num=$id` so you save the name, not just the suffix. — Barmar, Jul 21 '23 at 20:16
@Barmar it still lacks the `::/resolutions/5000`, `::/resolutions/10000`, and `::/resolutions/50000` substrings (i.e., the output of `cooler ls`) — melolili, Jul 21 '23 at 20:25
Funny -- we've had a question, in the distant past, asked here about someone writing almost the same code you're using here (also generating `hic5k_num`/`hic10k_num`/`hic15k_num` contants from a list). — Charles Duffy, Jul 21 '23 at 21:41
...oh, that was you (https://stackoverflow.com/questions/76314150/how-to-iteratively-run-a-program-on-multi-resolution-files) — Charles Duffy, Jul 21 '23 at 21:43
Anyhow -- as should be obvious, if 5000 isn't what you want to store and then use in the command line later, don't run `hic5k_num=$id_suffix`. Maybe you want `hic5k_num=$id` instead. — Charles Duffy, Jul 21 '23 at 21:45
If you do happen to have any interest in GNU parallel (imho it does simplify your script), I updated my answer to account for your request re dynamic `-O` option. — John Collins, Jul 22 '23 at 03:33

John Collins · Accepted Answer · 2023-07-27T07:49:55.467

You could use GNU Parallel to simplify your bash script.

Modification of your bash script using `parallel`:

EDIT: Modified bash script including dynamic `-O` option based on `grep` parsing `.mcool` input filenames

#!/usr/bin/env bash
for mcool_file in input/*.mcool; do

    OUTPUT=$(echo "$mcool_file" | grep -oE "[A-Z0-9]*\.hg38" | cut -d '.' -f 1)

    cooler ls "$mcool_file" | grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' \
    | tr '\n' '\t' \
    | parallel --dry-run --colsep '\t' --link \
    predictSV --hic-5k {1} --hic-10k {2} --hic-50k {3} \
    -O 'EagleC_output/'$OUTPUT -g hg38 --balance-type CNV --output-format full \
    --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999 \
    | tr -d "'"

done

which provides:

predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -O EagleC_output/A001C007 -g hg38 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999
predictSV --hic-5k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/5000 --hic-10k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C008.hg38.nodups.pairs.mcool::/resolutions/50000 -O EagleC_output/A001C008 -g hg38 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999

assuming two files named:

A001C007.hg38.nodups.pairs.mcool
A001C008.hg38.nodups.pairs.mcool

which contain cooler ls output showing resolutions per line as you provide in your question.

(Note: simply remove the --dry-run option from the same bash script, and then run it, to actually then execute your list of commands after having first done a 'dry run' to ensure that all the commands will be executed with proper syntax. There is also a very useful --joblog option.)

Details:

To help illustrate and break down how the above bash script works:

I made a test file called 'test.mcool' containing:

./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/2000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/20000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/100000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/250000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/1000000

to mimic your output from cooler ls.

Using bash grep gives:

$ grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' test.mcool 
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/5000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000
./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000

which could then be piped to parallel like so:

grep -E '\/[1][0]{4}$|\/[5][0]{3,4}$' test.mcool | tr '\n' '\t' | parallel --dry-run --colsep '\t' --link predictSV --hic-5k {1} --hic-10k {2} --hic-50k {3} [...]

See provided output of bash script above, where in addition the output option for predictSV is accounted for, and the command is setup to easily work with your provided bash script.

Explanation of `parallel` options

--dry-run: Prints out only the command as it would have been executed (removing this option will then actually run the command(s in parallel))
--link: Tell parallel to break apart, delimit (which is specified by the --colsep option; I had used tr to convert newlines into tabs), and partition as indicated (using integer numerals included in curly brackets) the piped input into the command pattern template to be executed in parallel.

(Extra note: A further improvement would be to additionally loop through all of the ./*.mcool files also in parallel (i.e., running a nested parallel single command). But that would only be of interest if you were concerned with improving the overall computation time of your analysis via a more fully realized parallelization of the predictSV commands.)

Would you mind incorporating the additional question under "Edit" into your answer? Thanks! — melolili, Jul 22 '23 at 03:43
How do I edit the script so that the output is like `predictSV --hic-5k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/500 --hic-10k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/10000 --hic-50k ./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/50000 -g hg38 -O input/A001C007 --balance-type CNV --output-format full --prob-cutoff-5k 0.8 --prob-cutoff-10k 0.8 --prob-cutoff-50k 0.99999`? Specifically, I want to remove the parenthesis '' from the output — melolili, Jul 22 '23 at 08:11
Thanks for the effort. The only thing now is that I want to specify the path as `./../input/something/something` instead of `input/something/something`. Your code is producing the latter. — melolili, Jul 22 '23 at 11:56
@JohnCollins, btw, this code is assuming that the inputs will always be in the given order (so `500` will come before `10000`, etc). Does `cooler ls` offer a firm guarantee that that assumption will always be true? — Charles Duffy, Jul 22 '23 at 12:51
It might also be worth running the answer given here through http://shellcheck.net/; it's missing a bit of quoting. Otherwise, though, looks pretty good at this point. — Charles Duffy, Jul 22 '23 at 12:55
`echo "$mcool_file"`, `ls "$mcool_file"`, etc as opposed to `echo $mcool_file`, `ls $mcool_file`, etc. See [I just assigned a variable, but `echo $variable` shows something else](https://stackoverflow.com/questions/29378566/i-just-assigned-a-variable-but-echo-variable-shows-something-else). In this particular case it would manifest if `IFS` were set to contain a character (like `:` or `/`) that exists inside the variables we're handling; there are also side effects related to globbing (which can manifest, or not, depending not just on the variable's contents but also the contents of the[...] — Charles Duffy, Jul 22 '23 at 14:40
[...]working directory, and shell runtime configuration like `nullglob`, `failglob`, &c. — Charles Duffy, Jul 22 '23 at 14:41
(See the shellcheck warning https://www.shellcheck.net/wiki/SC2086) — Charles Duffy, Jul 22 '23 at 14:42
To be clear, I still think you're being a bit incautious here -- assuming that the resolutions the OP asks for will always actually exist, for example, and having no error handling for one of them's missing. A lot of the code that makes my answer longer than yours is there for the sake of making it more robust when pieces don't work as intended; it's not just boilerplate. Similarly, if a future version of `mcool ls` _no longer_ sorts output, my answer will still work -- and one can of course pipe commands it generates into xargs to parallel, vs trying to keep them in the pipeline. — Charles Duffy, Jul 22 '23 at 14:47
Rather, that's one of the major _problems_ with parallel: it brings escaping into its scope and becomes responsible for doing it correctly, instead of following the UNIX philosophy of combining multiple tools each of which keeps its scope small. But it certainly can't fix the code that invokes it breaking the data before that data reaches it; for things where parallel itself isn't doing the substitution you still need to quote. — Charles Duffy, Jul 22 '23 at 15:46
Whereas if you're using xargs correctly there's no _need_ for it to do escaping because it's just adding literal argv elements. (GNU xargs can be used incorrectly but that's a design mistake; the POSIX standard doesn't require xargs to substitute data in the middle of strings at all, and it's better design not to try). — Charles Duffy, Jul 22 '23 at 15:50
@melolili FYI I've added some comments and output to our chat (https://chat.stackoverflow.com/rooms/254625/discussion-between-john-collins-and-melolili) [in case you have lost the link] — John Collins, Jul 22 '23 at 16:51
@JohnCollins I ran into another error: https://stackoverflow.com/questions/76847669/typeerror-read-csv-got-an-unexpected-keyword-argument-error-bad-lines — melolili, Aug 06 '23 at 20:25

Charles Duffy · Answer 2 · 2023-07-22T13:04:16.833

1

Given what you're trying to do, the _num suffix in your variable names is misleading. You don't want to store numbers; you want to store IDs. Thus, instead of setting hic5k_num=$id_suffix, you should store hic5k_num=$id -- or, to use a less-misleading variable name, just hic5k=$id:

#!/usr/bin/env bash
for mcool_file in input/*.mcool; do

  # iterate over ids emitted from cooler ls for this file
  hic5k_num=; hic10k_num=; hic50k_num=
  while IFS= read -r id; do  # id=./../input/A001C007.hg38.nodups.pairs.mcool::/resolutions/200
    id_suffix=${id##*/} # id_suffix=200
    id_part1=${id%%::*} # id_part1=./../input/A001C007.hg38.nodups.pairs.mcool
    id_part2=${id##*::} # id_part2=/resolutions/200

    # id_part1_basename=A001C007.hg38.nodups.pairs.mcool
    id_part1_basename=${id_part1##*/}

    # id_reassembled=A001C007.hg3.nodeps.pairs.mcool::/resolutions/200
    id_reassembled=$id_part1_basename::$id_part2

    case $id_suffix in
      5000)  hic5k=$id_reassembled;;
      10000) hic10k=$id_reassembled;;
      50000) hic50k=$id_reassembled;;
    esac
  done < <(cooler ls "$mcool_file")

  mcool_basename=${mcool_file##*/}
  printf '%q ' predictSV \
   ${hic5k:+  --hic-5k "$hic5k"} \
   ${hic10k:+ --hic-10k "$hic10k"} \
   ${hic50k:+ --hic-50k "$hic50k"} \
   -g hg38 \
   -O "EagleC_output/${mcool_basename%%.*}" \
   --balance-type CNV \
   --output-format full \
   --prob-cutoff-5k 0.8 \
   --prob-cutoff-10k 0.8 \
   --prob-cutoff-50k 0.99999
  printf '\n'
done

edited Jul 22 '23 at 13:04

answered Jul 21 '23 at 21:47

Charles Duffy

280,126
43
390
441

Thanks for the effort, Charles! Just one more thing...I want to change the `-O` from `"${mcool_file%%.*}"` to the following: `EagleC_output/` + substring after the `/` parenthesis in `mcool_file` file. For example, `EagleC_output/A001C007`. – melolili Jul 22 '23 at 02:22
@melolili, see the edit calculating and then using `mcool_basename` for that purpose. – Charles Duffy Jul 22 '23 at 12:47
@melolili, btw, if you wanted to parallelize this, once it emits the commands you want, you can run it with `| xargs -d $'\n' -P "$(nproc)" -n 1 -- bash -c` to run `nprocs` copies of `predictSV` in parallel. – Charles Duffy Jul 22 '23 at 12:48
...I'm sure @JohnCollins could tell you a Parallel one-liner that would do the same. – Charles Duffy Jul 22 '23 at 12:49
`| xargs -d $'\n' -P "$(nprocs)" -n 1 -- sh -c` after the `print`? – melolili Jul 22 '23 at 12:50
No, after the `done`, or even outside the script at all (if you save the script as `build_mcool_commands`, then `build_mcool_commands | xargs ....`). – Charles Duffy Jul 22 '23 at 12:51
(BTW, I did edit my earlier comment a bit; it should be just `nproc` not `nprocs`, and `bash` instead of `sh` since we want to be sure the shell can safely consume anything that `printf %q` can generate) – Charles Duffy Jul 22 '23 at 12:53
@melolili, btw, one thing that's difficult here: Stack Overflow works best when we're teaching, not writing code to spec. It's a lot more straightforward for me to write an answer to teach you the skills you need to write something that meets your needs yourself (as opposed to just writing your code for you) when you're the only person able to actually test that code. – Charles Duffy Jul 22 '23 at 13:04
1

@melolili, anyhow -- I went back and took another close look at your sample data and revised a bit further, but also in doing so added comments so you can see what the code is doing and revise it yourself as appropriate. The intent is not just to hand you something that works, but to show you _how_ and _why_ it works so you can maintain it yourself going forward. – Charles Duffy Jul 22 '23 at 13:04
I would just add that the repo for the OP's software, EagleC, under the `predictSV` command docs on the github page, explicitly states that parallelization on a cluster will be essentially necessary, given how intensive this computation is and the enormous size of the inputs. (As is usually the case in bioinformatics, hence, again, my impetus for recommending `parallel`.) – John Collins Jul 22 '23 at 13:42
Would you mind checking my follow-up question? https://stackoverflow.com/questions/76751758/error-syntax-error-near-unexpected-token-when-iterating-over-multi-resolution-f – melolili Jul 24 '23 at 12:16
@CharlesDuffy I'm not sure why the code throws `FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = 'A001C008.hg38.nodups.pairs.mcool',`. The `.mcool` files are in `./../input/` and I changed the for-loop to `for mcool_file in input/*.mcool` – melolili Jul 28 '23 at 07:00

Iteratively run a software on multi-resolution files

2 Answers2

Modification of your bash script using `parallel`:

EDIT: Modified bash script including dynamic `-O` option based on `grep` parsing `.mcool` input filenames

Details:

To help illustrate and break down how the above bash script works:

Explanation of `parallel` options

Linked

Iteratively run a software on multi-resolution files

2 Answers2

Modification of your bash script using parallel:

EDIT: Modified bash script including dynamic -O option based on grep parsing .mcool input filenames

Details:

To help illustrate and break down how the above bash script works:

Explanation of parallel options

Linked

Modification of your bash script using `parallel`:

EDIT: Modified bash script including dynamic `-O` option based on `grep` parsing `.mcool` input filenames

Explanation of `parallel` options