1

I am trying to find XML files containing a particular string. These files are however zipped as .gz. Essentially, I want to search through all of these gz files in the directory without extracting them. Additionally, I would like to get the specific filename which matches the search pattern and not the output itself.

I have managed to get the following command to get me the matching output itself from a piped grep command:

gunzip -c *.xml.gz | grep 'idName="M"'

I would like to get the filenames however. I read somewhere that the -l flag for grep will return the matching filename, but in this case, it gives me a result saying (standard input). I assume this is because I need to be piping the filename from gunzip too, but how do I do that?

Edit: Also adding that I have somewhat partial success by doing

gunzip -vc *.xml.gz | grep 'idName="M"'

but this gives me output like

filename_X:    30% -- replaced with stdout
filename_Y:    50% -- replaced with stdout
filename_Z:    complete matching output

I would like to suppress the matching output too in this case, and not show all the non-matching filenames.

  • Since `grep` reads from stadard input, it does not see any filenames at all, so grep can not report anything. – user1934428 May 11 '23 at 11:03
  • Actually, GNU `grep` provides the `--label` option to allow you to manually override this. But you need to pass one file at a time for this to work. – tripleee May 11 '23 at 11:18
  • How can `gunzip -vc *.xml.gz | grep 'idName="M"'` produce output like `filename_X: 30% -- replaced with stdout` that doesn't contain `idName="M"'`? – Ed Morton May 11 '23 at 12:21
  • Are you trying to output the name of the zip file or the name of one of the files contained in the zip file? – Ed Morton May 11 '23 at 12:24
  • @EdMorton - I'm not sure why this command produces the output. I can only see that it does. Either one of the filenames would work - either the zip file name or the file contained within. Each file is basically zipped into its own gz with the same name, just with .gz as extension. – Anup Sebastian May 11 '23 at 13:06
  • I suspect what you're seeing is the `stderr` output from `gunzip`, nothing to do with the `stdout` you want from `grep` and that `grep` is actually not producing any of the output you show. – Ed Morton May 11 '23 at 13:10
  • If "Each file is basically zipped into its own gz with the same name" then how can you get output of `filename_X` (no `.xml` extension) from a file whose name ends in `*.xml.gz`? Please make sure the input and output you provide are consistent with each other to make your question as clear as possible. – Ed Morton May 11 '23 at 13:13

1 Answers1

2

The zgrep family of tools exist exactly for this use case.

zgrep -l 'idName="M"' *.xml.gz

If you need the same for *.zip files, look for zipgrep.

If the pattern you are searching for is just a static string, not a regular expression, you can speed up processing by using the -F flag (aka legacy fgrep). This can make a substantial difference if the files are big.


If you need this for a file type for which you can't find an existing tool which provides this functionality, the implementation looks crudely something like

regex=$1
shift
for file; do
    gzip -dc <"$file" |
    sed -n "/$regex/s|^|$file:|p"
done

... with various complications to handle different options, etc; and with the caveat that this simple sed script has robustness issues in a number of corner cases (the regex can't contain a slash, and the file name can't contain a literal | or a newline).

If you have GNU grep, try something like

regex=$1
options=$(... complex logic to extract grep options ...)
shift
for file; do
    gzip -dc <"$file" |
    grep --label="$file" -H -e "$regex" $options
done

In your particular case, this can be reduced to just

regex=$1
shift
for file; do
    gzip -dc <"$file" |
    grep -q "$regex" &&
    echo "$file"
done

without any GNUisms.

Obviously, you'd replace gzip -dc with whatever you need to extract the information from the file type you want to process.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • I imagine you are already aware of the caveats around attempting to parse XML with regular expressions. If not, see also https://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – tripleee May 11 '23 at 11:15