I have a bunch of uploaded .root files on my laptop, but I need just specific ones

Question

I have a directory with 10000 .root files (each looks like hists11524_blinded.root or hists9899_blinded.root) in it and need to run some macros for my data analysis purposes. However, I don't need all of the files (just 4000 out of all) to be in the directory. I have a list of needed runs (these 4000 numbers) in thebest.txt file. This file is also in that directory with histograms.

I want to delete the files which are not needed for the processing before running macros by using the info from a .txt file.

That is how the thebest.txt file looks like:

My guess is to work with the command:

-comm -2 -3 <(ls) <(sort thebest) | tail +2 | xargs -p rm

I get 2 errors:

tail: invalid option -- 'p'

sort: cannot read: No such file or directory

The file thebest.txt contains only numbers with 5 digits like 09999 or 11256, the directory contains files with names like hists9999_blinded.root or hists11256_blinded.root.

The number of digits in both lists are different - that is the main issue.

How exactly are the files named? Please show some examples. What do you want to achieve? As you are using a `rm` command I assume you want to delete some of the files. Which files based on the numbers in `thebest.txt` and the existing files should be deleted? Please [edit] your question and add the missing information or clarification. If your file is named `thebest.txt`, you should specify exactly this name as `sort thebest.txt` instead of `thebest` without `.txt`. You should test your command step by step, e.g. `comm -2 -3 <(ls) <(sort thebest.txt)` first, then add `| tail +2` etc. — Bodo, Jul 02 '19 at 13:40
The name of the files are ``` hists1000_blinded.root ```. I have 10000 of them in a directory, but I need to run macro just for 4000 of those. The numbers of those 4000 histograms are located in thebest.txt file. I want to run a command which will delete other 6000 I don't need from the directory. True that I forgot to put .txt in a command but that still doesn't work. — sonic, Jul 02 '19 at 13:51
**Please [edit] your question and add this information instead of answering in a comment.** Can we assume that all numbers in `thebest.txt` have 5 digits (with leading 0 if necessary)? A file named `hists1000_blinded.root` doesn't match the 5-digit numbers. Would `thebest.txt` contain `1000` or `01000` to match this file? Or is the file in reality named `hists01000_blinded.root`? Formatting hint: Use a single backquote instead of 3 to get a code snippet inline. — Bodo, Jul 02 '19 at 14:14
I edited it as much as I could. All the numbers in `thebest.txt` have 5 digits (they don't start from 1, they start from 09769). The directory contains `.root` files which are named `hists9769_blinded.root` or `hists11526_blinded.root`. The problem is that the number of digits doesn't match. — sonic, Jul 02 '19 at 14:37
@Bodo i'm sorry for putting an example with 1000, I was in a rush — sonic, Jul 02 '19 at 14:39
Again: Please add all clarification you wrote in your comments **to the question**. The main problem was not the example with 1000 by itself, but the missing clarification about the numbers. Can there also be numbers with less than 4 digits in the file names? (For example when you start from the beginning.) — Bodo, Jul 02 '19 at 14:46
no, 4 is minimum, 5 is max. In the .txt file all the numbers have fixed 5 digits. — sonic, Jul 02 '19 at 14:57

score 0 · Accepted Answer · answered Jul 02 '19 at 15:23

One option is to remove the leading 0s from the numbers to match the file names. To avoid matching substrings you can prepend and append the corresponding file name parts. (In your case with the number in the middle of the file name.)

As it is not clear if the leading spaces in the sample file thebest.txt are intentional or only a formatting issue, leading spaces will be removed as well.

As deleting the wrong files may lead to data loss you may also consider processing the matching files only instead of deleting the non-matching files.

# remove leading spaces followed by leading zeros and prepend/append file name parts
sed 's/ *0*\([1-9][0-9]*\)/hists\1_blinded.root/' thebest.txt > thebestfiles.txt

# get matching files and process
find . -name 'hists*_blinded.root' | fgrep -f thebestfiles.txt | xargs process_matching

# or get non-matching files and remove
find . -name 'hists*_blinded.root' | fgrep -v -f thebestfiles.txt | xargs rm

The find command searches recursively in the current directory. If you want to exclude subdirectories you can use -maxdepth 1. To avoid processing directory names you might also add -type f.

@KristinaMikhailova If it works you can accept the answer. – Bodo Jul 09 '19 at 06:51 — Bodo, Jul 09 '19 at 06:51

I have a bunch of uploaded .root files on my laptop, but I need just specific ones

1 Answers1

Linked