1

I have 100+ directories as followed:

bins_copy]$ ls

bin.1/  
bin.112/  
bin.126/  
bin.24/  
bin.38/  

etc. etc.

Each of these directories contains two files names genes.faa and genes.gff, e.g. bin.1/genes.faa

I now want to add a suffix based on the parent directory so each gene file has a unique identifier, e.g. bin.1/bin1_genes.faa and bin1_genes.gff.

I've been going down the google rabbit hole all morning and nothing has sufficiently worked so far. I tried something like this:

for each in ./bin.*/genes.faa ; mv genes.faa ${bin%-*}_genes.faa $each ; done 

but that (and several versions of it) gives me the following error:

-bash: syntax error near unexpected token `mv'

Since this is a really generic one I haven't figured it out yet and truly would appreciate your help with.

Cheers/

Poshi
  • 5,332
  • 3
  • 15
  • 32
Marlene
  • 33
  • 4

2 Answers2

1

Try this Shellcheck-clean code:

#! /bin/bash -p

for genespath in bin.*/genes.*; do
    dir=${genespath%/*}
    dirnum=${dir##*.}
    genesfile=${genespath##*/}
    new_genespath="$dir/bin${dirnum}_${genesfile}"
    echo mv -iv -- "$genespath" "$new_genespath"
done
  • It currently just prints the required mv command. Remove the echo when you've confirmed that it will do what you want.
pjh
  • 6,388
  • 2
  • 16
  • 17
  • Hey, thanks so much. When I run this I'll get the same error as on the version below rename_files.sh: line 3: syntax error near unexpected token `$'do\r'' 'ename_files.sh: line 3: `for genespath in bin.*/genes.*; do – Marlene Mar 11 '22 at 12:58
  • The reference to `\r` shows that you've got Windows line endings (CRLF) in `rename_files.sh`. Bash can't handle them. You'll need to remove them. See [Are shell scripts sensitive to encoding and line endings?](https://stackoverflow.com/q/39527571/4154375) and [How to convert Windows end of line in Unix end of line (CR/LF to LF)](https://stackoverflow.com/q/3891076/4154375). – pjh Mar 11 '22 at 13:07
  • 1
    Wonderful, that fixed it for me. Thank you! – Marlene Mar 11 '22 at 13:27
0

There may be a more elegant way of doing this but create this script in the same directory as the bin directories, chmod 700 and run. you might want to back up with tar first (tar -cf bin.tar ./bin*)

#!/bin/bash
files="bin.*"
for f in $files; do
        mv ./${f}/genes.faa ./${f}/${f}_genes.faa
        mv ./${f}/genes.gff ./${f}/{$f}_genes.gff
done
rabbit
  • 95
  • 7
  • 1
    There is a typo in your second `mv` command (`{$F}`). And the question mentions that he wants to remove the `.` between `bin` and the digits. – Nic3500 Mar 11 '22 at 02:59
  • Hey, thank you so much! I fixed the typo and ran the scrip in the directory with all the subdirectories but it gives me a similar syntax error as before: rename_files.sh: line 3: syntax error near unexpected token `$'do\r'' 'ename_files.sh: line 3: `for f in $files; do – Marlene Mar 11 '22 at 12:55
  • Is there a way to add the same thing to the fasta header, e.g. >bin.100_ProteinName xyz? I have tried to do it in a similar fashion like posted above and that didn't work. I also tried for each in *.faa ; do sed -i "s/>/>${file%%.*}_/" $each; done which didn't work either. Does anybody know how to do that? Or can point me in the right direction? – Marlene Mar 11 '22 at 19:29
  • can you give an example of what ls -l returns in a directory that has the "protien Name" files you want renamed and an example of what you want it renamed to? – rabbit Mar 12 '22 at 10:52
  • Sure! When I run ls -l it looks like this -rw-r--r-- 1 mjensen2 ncsu 365580 Mar 9 11:38 BC-1_bin.109_genes.faa -rw-r--r-- 1 mjensen2 ncsu 209980 Feb 28 12:16 BC-1_bin.109_genes.gff -rw-r--r-- 1 mjensen2 ncsu 535387 Feb 28 12:16 hmmer.analyze.txt -rw-r--r-- 1 mjensen2 ncsu 2321 Feb 28 12:16 hmmer.tree.txt The header of the files currently look like this >BC-1_k127_1964613_1 # 1 # 525 # -1 # ID=1_1;partial=10;start_type=ATG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.480 and I want it to start like this: >>BC-1_bin.109_k127_1964613_1. – Marlene Mar 14 '22 at 12:03
  • By header do you mean the first line contained in all the files in the directory? – rabbit Mar 14 '22 at 13:02
  • No, all the lines that start with ">". In fasta format you usually have the "header" that contains a name and DNA/protein description. E.g. >NODE_794235_length_253_cov_7.785714 GGCGTTACAGCGCTTGGCAATA >NODE_794238_length_253_cov_7.658730 GCTCTGCAAAGGTTCATCGAATCCGATACCAGGGATTGACTAAAACCTAG CGGGGCTTT CCTCAATAGGGCAGGATTTACAGGAATACGCGGGATTTACAGGATTCTTG TCGTTGCTGAATTTCCTACGCCTTAGGAACGGTTCACTAACACGTTATCC ATCGTCTTAGGTCTGGTGGCGCCCCTTGTCGAACCTCACACCAAGACACG AAG >NODE_794259_length_253_cov_6.357143 TCTGCGTCACACGACCTGAGCGCGGTAGTCGTCATCCCAAGCGGCTAGGC GTCGTTTCTGTGCCAGACGCCGGGAGGAGGACTCGCGTTCAACAAA – Marlene Mar 14 '22 at 14:06