0

This bash script has a simple premise for now, recursively look through a directory, take every file from that directory, and from those files look for patterns that match email addresses, take those email addresses, sort them and then count the amount of times they appear, and sort them again.

Take those sorted emails and then with the head script, print the top emails according to the variable PUNISHED.

The directory is via the second argument, and if no directory is chosen then it simply runs through the current directory.

like this.

./myscript 20 /usr/home/AlabasterTenRing

Here is the code.

#!/bin/bash

shopt -s globstar

PUNISHED=$1
VENOM= $2

echo >> topemails.txt

for files in ./${VENOM}/**/*; do
    if [ -f "${files}" ] ; then
        <"$files" tr '[[:upper:]]' '[[:lower:]]' \
            | grep -i -o '[A-Za-z0-9._%+-]\ + @[A-Za-z0-9.]\ + .[A-Za-z]\{2,4\}' \
            | xargs -n 1 \
            | sort \
            | uniq -c \
            | sort -nr > topemails.txt
    fi
done
echo "The top \"${PUNISHED}\" emails are"
head -$PUNISHED topemails.txt

What ends up happening instead is that topemails.txt prints just as intended, but the number '1' is all that appears in it.

What could I do differently?

Benjamin W.
  • 46,058
  • 19
  • 106
  • 116
MATT_BOOTY
  • 83
  • 8
  • 1
    You overwrite the file on each iteration. You can use `>>` in the loop to append (or even better, simply redirect the entire loop) – that other guy Oct 03 '19 at 17:22
  • Side Comments: check if 'xargs' is needed. The 'grep' command will print one token per line, so in theory, no need for xargs. Also, 'grep' is using case insensitive match (-i), therefore no need to use both A-Z and a-z in the patterns. – dash-o Oct 03 '19 at 17:25
  • Also 'VENOM= $2' will not work if '$2' is non-empty. Space not allowed between '=' and value. – dash-o Oct 03 '19 at 17:25
  • If you code still does not work, consider sharing sample input (few lines) data from one of your files – dash-o Oct 03 '19 at 17:28
  • @thatotherguy when you say use >> in the loop to append, can you elaborate? Also, what do you mean redirect the entire loop? – MATT_BOOTY Oct 03 '19 at 17:42

2 Answers2

1

Something I think is equivalent to your script, but it depends on which grep favor is used. You can use grep to scan recursively and skip the loop, since you seem to scan all files anyway.

#!/bin/bash

PUNISHED=$1
VENOM=$2

echo "The top ${PUNISHED} emails are"
grep -Eroh "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}" "./${VENOM}" \
    | tr '[:upper:]' '[:lower:]' \
    | sort \
    | uniq -c \
    | sort -nr \
    | head -n $PUNISHED

Domain names can have longer top levels than 4 characters now, but I left the 4 inthe regex.

Snorre Løvås
  • 251
  • 1
  • 5
1

Here's an easier way to reproduce your problem:

for i in 1 2 3
do
  echo "$i" > file
done

You're expecting file to contain:

1
2
3

Instead it contains:

3

This is because > truncates and overwrites the file each time, so you'll only get results from the last iteration.

You can either have each iteration append rather than overwrite:

for i in 1 2 3
do
  echo "$i" >> file
done

Or just redirect the whole loop, so that everything it outputs will be collected in the file:

for i in 1 2 3
do
  echo "$i"
done > file
that other guy
  • 116,971
  • 11
  • 170
  • 194
  • I used your fixes, this is progress but now only the directory name appears in topemails.txt, none of the files that should be recursively run through even so much affect the script. – MATT_BOOTY Oct 03 '19 at 18:32
  • @MATT_BOOTY It does if you put e.g. `a + @b + cde` in a file. `foo@example.com` won't affect the result since the regex doesn't match it – that other guy Oct 03 '19 at 18:38
  • Wait, my regex doesn't match foo@example.com? But that's exactly what I want to be matching. – MATT_BOOTY Oct 03 '19 at 18:45
  • See [this question](https://stackoverflow.com/questions/2898463/using-grep-to-find-all-emails) for various suggestions on how to match emails with `grep`. You'll find scripting much easier if you write and test small parts of individual commands every step of the way, rather than writing a whole script and then try to debug all issues simultaneously. `grep -i -o '[A-Za-z0-9._%+-]\ +'` should be matching `hello` but doesn't, so that narrows the problem down considerably (in this case due to the space) – that other guy Oct 03 '19 at 19:02
  • I don't think it's the grep anymore, I changed it to one that had worked, even removed tr, it didn't work at all. It only lists the directory that I inserted as an argument as opposed to recursively looking through the directory for files. – MATT_BOOTY Oct 03 '19 at 20:21
  • @MATT_BOOTY When you say "It only lists the directory that I inserted as an argument", do mean "I'm getting the error `./myscript: line 6: /usr/home/AlabasterTenRing: Is a directory`"? If so, that's because you have a space after the `=` on line 6. If you fix that and get zero matches, it's because you're looking at `.//usr/home/AlabasterTenRing/**/*` instead of `/usr/home/AlabasterTenRing/**/*` – that other guy Oct 03 '19 at 20:27