Search multiple strings in one line using regex in nested files/directories and output matched results

Question

For example, if there are files and directories:

/tmp/temp_dir/subdir_001/file_001.txt
/tmp/temp_dir/subdir_001/file_002.txt
/tmp/temp_dir/subdir_002/file_003.txt
/tmp/temp_dir/subdir_003/file_004.txt

And those have various contents with specific lines that can be found by regex. For example, here is the content of the file file_001.txt:

abc cba
little boy writes -54321_12345 and goes to street 987
bca acb
little boy writes -12345_54321 and jumps to street 789
cab bac

What I'm interested is those lines that are started with little boy writes. I'm using this regex pattern to find important data that I would like to save as output: little boy writes (\-\d+\_\d+).*street (\d+)

How can I search it recursively and output only matched strings? So in output file I would only have this:

54321_12345 987
12345_54321 789

Thing about using find and exec. – Raman Sailopal Jul 30 '20 at 09:26 — Raman Sailopal, Jul 30 '20 at 09:26

Aserre · Accepted Answer · 2020-07-31T13:47:49.603

A combination of find and sed should do the trick :

find /tmp/temp_dir/ -type f -exec sed -En 's/little boy writes -([0-9]+_[0-9]+).*street ([0-9]+)/\1 \2/p' {} + > output

Breakdown :

find /tmp/temp_dir/ -type f : we find every file recursively from the root folder
-exec sed '... ' {} + runs a command on every file found (here {} represents the item retrieved by the find command, and + means the command is executed once againt the final result, as explained here)
sed -En 's/little boy writes -([0-9]+_[0-9]+).*street ([0-9]+)/\1 \2/p' : we run the pattern you described in your question, using sed (\d is not a valid sed character class, we use [0-9] instead)
> output we redirect the output of this command to a file called output

Joe · Answer 2 · 2020-07-30T12:49:47.477

2

You could use grep combined with sed:

$ grep '^little boy writes' /tmp/temp_dir/subdir_*/file_*.txt | sed -re 's/^.* -([0-9]+_[0-9]+).*street ([0-9]+)/\1 \2/' > output.txt

edited Jul 30 '20 at 12:49

answered Jul 30 '20 at 10:04

Joe

877
1
11
26

Paul Hodges · Answer 3 · 2020-07-30T14:11:11.747

You could get the lines with just a recursive grep, with or without filenames.

grep -r  '^little boy writes' *  # lists source filenames
grep -hr '^little boy writes' *  # does not

This reports the whole line, though. Perl pattern matching (-P) with -o could probably detect the right lines and only return the bits you want, but the pattern would be horrible for most people to understand and maintain, so it's probably worth a second process -

grep -hr '^little boy writes' /tmp/temp_dir/subdir_[0-9][0-9][0-9]/file_[0-9][0-9][0-9].txt |
  sed -E 's/[^0-9_]*([0-9_]+)/\1 /g'

or if you really want to avoid that space at the end,

grep -hr '^little boy writes' /tmp/temp_dir/subdir_[0-9][0-9][0-9]/file_[0-9][0-9][0-9].txt |
  's/^[^0-9_]*([0-9_]+)[^0-9_]*([0-9_]+$)/\1 \2/'

But if you know exactly where those files are well enough for globbing like that, all you need is the sed.

sed -En '/^little boy writes/{ s/^[^0-9_]*([0-9_]+)[^0-9_]*([0-9_]+$)/\1 \2/g; p; }' /tmp/temp_dir/subdir_[0-9][0-9][0-9]/file_[0-9][0-9][0-9].txt

If you don't, grep and/or sed may grind through a lot of data that you could avoid...and maybe your directory structure isn't quite that consistent. In that case, shopt will help.

shopt -s globstar # let's ** stand for variable depth of subdirectories
sed -En '/^little boy writes/{ s/^[^0-9_]*([0-9_]+)[^0-9_]*([0-9_]+$)/\1 \2/g; p; }' **/file_[0-9][0-9][0-9].txt

That should be a lot more efficient (and so faster). It will let the OS pick files that match and hand only those to sed for scanning.

This also uses just one instance of sed, rather than spawning one for each file with find or needing an xargs.

Good luck.

Search multiple strings in one line using regex in nested files/directories and output matched results

3 Answers3