1

I have two directories with files that end in two different extensions:

Folder A called profile (1204 FILES)

file.fasta.profile
file1.fasta.profile
file2.fasta.profile


Folder B called dssp (1348 FILES)


file.dssp
file1.dssp
file2.dssp
file3.dssp #<-- odd one out

I have some files in folder B that are not found in folder A and should be removed for example file3.profile would be deleted as it is not found in folder A. I just want to retain those that are common in their filename, but excluding extension to end up with 1204 files in both I saw some bash lines using diff but it does not consider this case, where the ones I want to remove are those that are not found in the corresponding other file.

Francesca C
  • 177
  • 1
  • 9
  • use glob with https://stackoverflow.com/questions/678236/how-do-i-get-the-filename-without-the-extension-from-a-path-in-python – bmjeon5957 May 10 '22 at 00:49

4 Answers4

2

Here is a way to do it:

  • for both A and B directories, list the files under each directory, without the extension.
  • compare both lists, show only the file that does not appear in both.

Code:

#!/bin/bash

>a.list
>b.list

for file in A/*
do
    basename "${file%.*}" >>a.list
done

for file in B/*
do
    basename "${file%.*}" >>b.list
done

comm -23 <(sort a.list) <(sort b.list) >delete.list

while IFS= read -r line; do
    rm -v A/"$line"\.*
done < "delete.list"

# cleanup
rm -f a.list b.list delete.list
  • "${file%.*}" removes the extension
  • basename removes the path
  • comm -23 ... shows only the lines that appear only in a.list

EDIT May 10th: my initial code listed the file, but did not delete it. Now it does.

Nic3500
  • 8,144
  • 10
  • 29
  • 40
  • I like that this shows the list of files that are to be removed, the last line is for the removal yes? – Francesca C May 10 '22 at 09:31
  • I just followed it through, it seemed to create the two a.list and b.list files but it created it with the actual name written: `file%.*`. it did this for both. – Francesca C May 10 '22 at 09:41
  • Do you want Python or bash solution? You mention Jupiter in comments to one of the answers, that runs Python great, not sure about all the subtleties of bash (ex. variable expansion like `${file&.*}`. I tested it directly in bash on a Linux box. – Nic3500 May 10 '22 at 12:57
  • I could run python or bash from Jupiter the same, well at least it seems that way to me – Francesca C May 10 '22 at 12:59
  • Can I make this code run in Jupyter though? I ran it through the shell cause I didn't think it would work – Francesca C May 10 '22 at 13:00
  • I do not have a Jupyter setup to test it real quick, so YMMV. It works for me on Linux Mint 20, GNU bash 4.4.20. – Nic3500 May 10 '22 at 13:01
  • okay I will run it on the shell then...so I create 2 files, a and b, add the list without the file extension to each corresponding list and then find those in common yes? – Francesca C May 10 '22 at 13:04
  • do I need to write the entire path even in I am in the directory? – Francesca C May 10 '22 at 13:06
  • I am doing `for file in dssp/*` for example, should there be a .*? – Francesca C May 10 '22 at 13:06
  • `for file in dssp/*` will list all files under sub-directory `dssp`. No need to put `*.*`. `*` matches all, extension included. – Nic3500 May 10 '22 at 13:08
  • it seems to be working! okay how do I see the final outputs then? or how can I verify it was successful? – Francesca C May 10 '22 at 13:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/244636/discussion-between-francesca-c-and-nic3500). – Francesca C May 10 '22 at 13:15
  • `/bin/ls -c1 A` and `/bin/ls -c1 B` should be the same. – Nic3500 May 10 '22 at 13:46
2

Try this Shellcheck-clean Bash program:

#! /bin/bash -p

folder_a=PATH_TO_FOLDER_A
folder_b=PATH_TO_FOLDER_B

shopt -s nullglob
for ppath in "$folder_a"/*.profile; do
    pfile=${ppath##*/}
    dfile=${pfile%.profile}.dssp
    dpath=$folder_b/$dfile
    [[ -f $dpath ]] || echo rm -v -- "$ppath"
done
  • It currently just prints what it would do. Remove the echo once you are sure that it will do what you want.
  • shopt -s nullglob makes globs expand to nothing when nothing matches (otherwise they expand to the glob pattern itself, which is almost never useful in programs).
  • See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for information about the string manipulation mechanisms used (e.g. ${ppath##*/}).
pjh
  • 6,388
  • 2
  • 16
  • 17
2

With find:

find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'! [ -f "folder B/$(basename -s .fasta.profile "$1").dssp" ]' _ {} \; -print

Replace -print by -delete when you will be convinced that it does what you want.

Or, maybe a bit faster:

find 'folder A' -type f -name '*.fasta.profile' -exec sh -c \
'for f in "$@"; do [ -f "folder B/$(basename -s .fasta.profile "$f").dssp" ] || echo rm "$f"; done' _ {} +

Remove echo when you will be convinced that it does what you want.

Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
  • it does not seem to work, I realised that its .fasta.profile, not just .profile, I added the .fasta portion in, but it does not change anything. – Francesca C May 10 '22 at 09:31
  • 1
    Edited accordingly but please edit your question and provide an accurate description of your problem. Remember that SO is not just a Q&A site. It is also a knowledge database. It is important that questions and answers can be useful to others. – Renaud Pacalet May 10 '22 at 10:07
  • I tried it, btw I am running with Jupyter. it does not print anything, but does not create any error either. It seems to be working on something however as it looks as though it is loading something, but the output is empty. – Francesca C May 10 '22 at 12:52
  • just to show: the files are titled like such: d1ayoa_.dssp, d1a62a1.dssp --- d1ayoa_.fasta.profile, d1a62a1.fasta.profile – Francesca C May 10 '22 at 12:58
  • From which directory do you launch these `find` commands? It must be from the parent directory of your two folders. By the way, did you check the folder names? Is it really `folder A` and `folder B` with a space between `folder` and `A/B`? If not you must adapt the `find` commands. – Renaud Pacalet May 10 '22 at 15:05
  • And **please edit your question**. If you don't it could be downvoted and closed before you will receive a working answer. – Renaud Pacalet May 10 '22 at 15:07
1

Python version:

EDIT: now suports multiple extensions

#!/usr/bin/python3

import glob, os

def removeext(filename):
    index = filename.find(".")
    return(filename[:index])

setA = set(map(removeext,os.listdir('A')))
print("Files in directory A: " + str(setA))

setB = set(map(removeext,os.listdir('B')))
print("Files in directory B: " + str(setB))

setDiff = setA.difference(setB)
print("Files only in directory A: " + str(setDiff))

for filename in setDiff:
    file_path = "A/" + filename + ".*"
    for file in glob.glob(file_path):
        print("file=" + file)
        os.remove(file)

Does pretty much the same as my bash version above.

  • list files in A
  • list files in B
  • get the list of differences
  • delete the differences from A

Test output, done on Linux Mint, bash 4.4.20

mint:~/SO$ l
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 A/
drwxr-xr-x 2 Nic3500 Nic3500 4096 May 10 10:36 B/

mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:36 file4.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile
mint:~/SO$ l B
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file1.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file3.dssp
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:05 file.dssp


mint:~/SO$ ./so.py
Files in directory A: {'file1', 'file', 'file3', 'file2', 'file4'}
Files in directory B: {'file1', 'file', 'file3', 'file2'}
Files only in directory A: {'file4'}
file=A/file4.fasta.profile


mint:~/SO$ l A
total 0
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file1.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file2.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:14 file3.fasta.profile
-rw-r--r-- 1 Nic3500 Nic3500 0 May 10 10:06 file.fasta.profile
Nic3500
  • 8,144
  • 10
  • 29
  • 40
  • I think there must be a problem with the fact that it is .fasta.profile and not just .fasta pr .profile as here I get: `Files in directory A: {'d1y0pa3.fasta', 'd3nvsa_.fasta', 'd1k8kd1.fasta', 'd1j09a1.fasta', 'd1vh4a_.fasta', 'd2oy9a1.fasta', 'd3ag3h_.fasta', 'd1wtja_.fasta', 'd3mzfa2.fasta', 'd1g2ra_.fasta', 'd1knza_.fasta', 'd2fcwa1.fasta', 'd1a62a1.fasta', 'd2fgqx_.fasta', 'd2idob_.fasta', 'd1mwqa_.fasta', 'd1k8ke_.fasta', 'd1oisa_.fasta', 'd3essa_.fasta', 'd1fn9a_.fasta', 'd2vn6b_.fasta'` but it is not deleted anything. – Francesca C May 10 '22 at 14:02
  • Fixed for multiple extensions. – Nic3500 May 10 '22 at 14:21
  • I simply get an output of: `Files in directory A: set() Files in directory B: set() Files only in directory A: set()` hmph. – Francesca C May 10 '22 at 14:34
  • Tested on python 3.6 – Nic3500 May 10 '22 at 14:49
  • finally! I think it worked...but... :1204 1205 is the results... there is one left out! let me check. – Francesca C May 10 '22 at 15:02
  • how can I print the files that are different only considering the filename? – Francesca C May 10 '22 at 15:06
  • That is what you get from the "Files only in directory A" output. – Nic3500 May 10 '22 at 15:09
  • When I use the line: `setDiff = setB.difference(setA) print("Files only in directory B: " + str(setDiff))` I get `Files only in directory B: set()` and I did try vice-versa just in case... it is strange no? as clearly there is a difference. – Francesca C May 10 '22 at 15:12
  • 1
    FYI for your next question, include more details. Ex. running in Jupyter, tag the language you prefer os indicate that you accept both, complete data set for our tests. – Nic3500 May 10 '22 at 15:12
  • All files in B are included in A, so the difference is empty, hence the empty set `set()`. – Nic3500 May 10 '22 at 15:13
  • the number of files in dssp is 1205, that is what I put for set A which originally was 1348. while the profile folder or folder B, is what I set for set B. I think it is important to mention that the order in which you put this in matters cause when I flipped the two sets, it does not work. the larger set much be in set A – Francesca C May 10 '22 at 15:23
  • I ran it again and I am still getting one extra in set A, I am running this code: `list(set(setA) - set(setB))` and it results empty, why is that? – Francesca C May 10 '22 at 15:24
  • I am running this: `!ls /Users/user/Desktop/Desktop/LAB2/prova_training/profile | wc -l !ls /Users/user/Desktop/Desktop/LAB2/prova_training/dssp | wc -l` to check the length, maybe this is the problem? as running len(setA) and len(setB) I get 1201 for both. it should be 1204 though. – Francesca C May 10 '22 at 15:34
  • If you run the script once, it deleted files, so you can't run it over and over again on the same source directories after they have been cleanup once. As to your wc, impossible to know since I do not have access to your system. – Nic3500 May 10 '22 at 15:39
  • No, I am running on a new directory each time that I have duplicated... I still get 1201 for Len and 1204 for wc..very strange. – Francesca C May 10 '22 at 15:42
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/244643/discussion-between-francesca-c-and-nic3500). – Francesca C May 10 '22 at 16:12