1

I am trying to count the number of line matches in a very LARGE file and store them in variables using only the BASH shell commands.

Currently, i am scanning the results of a very large file twice and using a separate grep statement each time, like so:

$ cat test.txt 
first example line one
first example line two
first example line three
second example line one
second example line two
$ FIRST=$( cat test.txt | grep 'first example'  | wc --lines ; ) ;  ## first run
$ SECOND=$(cat test.txt | grep 'second example' | wc --lines ; ) ;  ## second run

and I end up with this:

$ echo $FIRST
3
$ echo $SECOND
2

Hopefully, I want to only scan the large file just once. And I have never used Awk and would rather not use that!

The |tee option is new to me. It seems that passing the results into two separate grep statements may mean that we only have to scan the large file once.

Ideally, I would also like to be able to do this without having to create any temporary files & subsequently having to remember to delete them.

I have tried multiple ways using something like these below:

FIRST=''; SECOND='';
cat  test.txt                                                   \
    |tee  >(FIRST=$( grep 'first example'  | wc --lines ;);)    \
          >(SECOND=$(grep 'second example' | wc --lines ;);)    \
          >/dev/null        ;

and using read:

FIRST=''; SECOND='';
cat  test.txt                                                       \
   |tee  >(grep 'first example'   | wc --lines | (read FIRST);  );  \
         >(grep 'second example'  | wc --lines | (read SECOND); );  \
         > /dev/null                   ;



cat  test.txt                                                           \
      | tee  <( read FIRST  < <(grep 'first example'  | wc --lines ))   \
             <( read SECOND < <(grep 'sedond example' | wc --lines ))   \
             >    /dev/null             ;

and with curly brackets:

FIRST=''; SECOND='';
cat test.txt                                                     \
  |tee   >(FIRST={$( grep 'first example'  | wc --lines ;)} )    \
         >(SECOND={$(grep 'second example' | wc --lines ;)} )    \
         >/dev/null                           ;
   

but none of these allow me to save the line count into variables FIRST and SECOND.

Is this even possible to do?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
edwardsmarkf
  • 1,387
  • 2
  • 16
  • 31
  • "$( cat test.txt | grep 'first example' | wc --lines ; ) " you don't need cat, just "grep 'first example' test.txt" and you will avoid extra process – Saboteur Sep 26 '21 at 23:00
  • You will find an answer [here](https://stackoverflow.com/questions/28145077/bash-to-find-count-of-multiple-strings-in-a-large-file). – tshiono Sep 26 '21 at 23:05
  • why the aversion to using `awk`? with one relatively simple `awk` script you can replace all of your current code **and** only scan the file once – markp-fuso Sep 26 '21 at 23:28
  • The `>(command)` process substitution runs the command in a subshell. A variable you set in a subshell will not persist into the parent shell that's running the script. Sorry, this approach cannot work as you hope. – glenn jackman Sep 27 '21 at 03:12
  • @edwardsmarkf : `tee` is not an _option_, it is an executable command - see _man tee_. – user1934428 Sep 27 '21 at 07:46
  • @edwardsmarkf : I don't quite understand what you want to achieve: Which information exactly would you like to have saved for later use? The output of the `grep` **and** the number of lines in that output? – user1934428 Sep 27 '21 at 07:48

4 Answers4

6

tee isn't saving any work. Each grep is still going to do a full scan of the file. Either way you've got three passes through the file: two greps and one Useless Use of Cat. In fact tee actually just adds a fourth program that loops over the whole file.

The various | tee invocations you tried don't work because of one fatal flaw: variable assignments don't work in pipelines. That is to say, they "work" insofar as a variable is assigned a value, it's just the value is almost immediately lost. Why? Because the variable is in a subshell, not the parent shell.

Every command in a | pipeline executes in a different process and it's a fundamental fact of Linux systems that processes are isolated from each other and don't share variable assignments.

As a rule of thumb, you can write variable=$(foo | bar | baz) where the variable is on the outside. No problem. But don't try foo | variable=$(bar) | baz where it's on the inside. It won't work and you'll be sad.

But don't lose hope! There are plenty of ways to skin this cat. Let's go through a few of them.

Two greps

Getting rid of cat yields:

first=$(grep 'first example' test.txt | wc -l)
second=$(grep 'second example' test.txt | wc -l)

This is actually pretty good and will usually be fast enough. Linux maintains a large page cache in RAM. Any time you read a file Linux stores the contents in memory. Reading a file multiple times will usually hit the cache and not the disk, which is super fast. Even multi-GB files will comfortably fit into modern computers' RAM, particularly if you're doing the reads back-to-back while the cached pages are still fresh.

One grep

You could improve this by using a single grep call that searches for both strings. It could work if you don't actually need the individual counts but just want the total:

total=$(grep -e 'first example' -e 'second example' test.txt | wc -l)

Or if there are very few lines that match, you could use it to filter down the large file into a small set of matching lines, and then use the original greps to pull out the separate counts:

matches=$(grep -e 'first example' -e 'second example' test.txt)
first=$(grep 'first example' <<< "$matches" | wc -l)
second=$(grep 'second example' <<< "$matches" | wc -l)

Pure bash

You could also build a Bash-only solution that does a single pass and invokes no external programs. Forking processes is slow, so using only built-in commands like read and [[ can offer a nice speedup.

First, let's start with a while read loop to process the file line by line:

while IFS= read -r line; do
   ...
done < test.txt

You can count matches by using double square brackets [[ and string equality ==, which accepts * wildcards:

first=0
second=0

while IFS= read -r line; do
    [[ $line == *'first example'* ]] && ((++first))
    [[ $line == *'second example'* ]] && ((++second))
done < test.txt

echo "$first"   ## should display 3
echo "$second"  ## should display 2

Another language

If none of these are fast enough then you should consider using a "real" programming language like Python, Perl, or, really, whatever you are comfortable with. Bash is not a speed demon. I love it, and it's really underappreciated, but even I'll admit that high-performance data munging is not its wheelhouse.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • very interesting. should the wildcards look like this? `'*first example*'` with asterisk inside the quotes? – edwardsmarkf Sep 27 '21 at 02:25
  • 1
    I know, it looks funny. If they're quoted then they are literal asterisks. They need to be unquoted to be treated as wildcards. – John Kugelman Sep 27 '21 at 02:30
  • "I know, it looks funny. " - i can say that about the entire OS. "Two of the most famous products of Berkeley are LSD and Unix. I don’t think that is a coincidence.” – edwardsmarkf Sep 27 '21 at 03:02
  • one last dumb question - what happens if instead of reading directly from `test.txt` i actually wanted to use the results from another command such as `find` ? i tried changing: ```done < $(cat test.txt;)``` but got an "ambiguous redirect" error. – edwardsmarkf Sep 27 '21 at 05:21
  • 1
    Use [process substitution](https://www.gnu.org/software/bash/manual/html_node/Process-Substitution.html): `done < <(cat test.txt)` – John Kugelman Sep 27 '21 at 12:00
  • ha! i had just researched this & found the same solution you graciously provided me with! https://stackoverflow.com/questions/16854280/a-variable-modified-inside-a-while-loop-is-not-remembered THANK YOU SO MUCH John K. i see why you have earned every one of your 65 well deserved gold badges. – edwardsmarkf Sep 27 '21 at 16:08
3

If you're going to be doing things like this, I'd really recommend getting familiar with awk; it's not scary, and IMO it's much easier to do complex things like this with it vs. the weird pipefitting you're looking at. Here's a simple awk program that'll count occurrences of both patterns at once:

awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt

Explanation: /first example/ {first++} means for each line that matches the regex pattern "first example", increment the first variable. /second example/ {second++} does the same for the second pattern. Then END {print first second} means at the end, it should print the two variables. Simple.

But there is one tricky thing: splitting the two numbers it prints into two different variables. You could do this with read:

bothcounts=$(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
read first second <<<"$bothcounts"

(Note: I recommend using lower- or mixed-case variable names, to avoid conflicts with the many all-caps names that have special functions.)

Another option is to skip the bothcounts variable by using process substitution to feed the output from awk directly into read:

read first second < <(awk '/first example/ {first++}; /second example/ {second++}; END {print first, second}' test.txt)
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Gordon Davisson
  • 118,432
  • 16
  • 123
  • 151
  • This is the solution I'd use. It's funny, I was going to mention it in my answer initially but thought, "Nah, I don't want to go into how to extract two values from one command." Then my answer ballooned into a Charles Dickens novel and I forgot to circle back around and put it in. – John Kugelman Sep 27 '21 at 01:16
  • @JohnKugelman I wondered why you hadn't covered it. BTW, thanks for the edit; I carefully found & fixed that problem, and then carefully copied & pasted the wrong version. D'oh! – Gordon Davisson Sep 27 '21 at 02:58
  • i don't do stuff like this often enough to learn awk i am afraid. and i used to use perl until the world seemed to move away from it, and i got tired of waiting for perl-6 (like waiting for republican presidential candidate tax returns) – edwardsmarkf Sep 27 '21 at 05:10
0

">" is about redirect to file/device, not to the next command in pipe. So tee will just allow you to redirect pipe to multiple files, not to multiple commands. So just try this:

FIRST=$(grep 'first example' test.txt| wc --lines)
SECOND=$(grep 'second example' test.txt| wc --lines)
Saboteur
  • 1,331
  • 5
  • 12
  • 2
    `>(command)` isn't actually redirection, it's [process substitution](https://tldp.org/LDP/abs/html/process-sub.html). – John Kugelman Sep 26 '21 at 23:34
  • ">" is redirection, and $(command) is process substitution I think, kindly correct me – Saboteur Sep 27 '21 at 00:09
  • `$(command)` is [command substitution](https://www.gnu.org/software/bash/manual/html_node/Command-Substitution.html). `<(command)` and `>(command)` are [process substitution](https://www.gnu.org/software/bash/manual/html_node/Process-Substitution.html). (One or both of these is poorly named, I admit.) – John Kugelman Sep 27 '21 at 00:17
0

It's possible to get matches an count them in a single pass, then get the count of each from the result.

matches="$(grep -e 'first example' -e 'second example' --only-matching test.txt | sort | uniq -c | tr -s ' ')"

FIRST=$(grep -e 'first example' <<<"$matches" | cut -d ' ' -f 2)
echo $FIRST

Result:

3

Using awk is the best option I think.

LMC
  • 10,453
  • 2
  • 27
  • 52