2

I am trying to use bash to parse and HTML file using grep.

The HTML won't change so I should be able to find the text easy enough.

The HTML will be like this, and I just want the number which will change each time the file changes:

<div class="total">
          900 files inspected,
          28301 offenses detected:
        </div>


grep -E '^<div class="total">.</div>' my_file.html

Ideally I just want to pull the number of offenses so in the example above it would be 28301. I would like to assign it to a variable also.

Am I close?

user3437721
  • 2,227
  • 4
  • 31
  • 61

2 Answers2

1

you can do a simple

a=$(grep -oP '(\d+)(?=\soffenses\sdetected)' abc);echo $a

will give:

28301

-o only gives the matching part of the line

-P uses perl regular expression in regex

abc is the name of the file

(\d+)(?=\soffenses\sdetected) in this reges we are just using positive lookahead to capture the require digits that are followed by a particular word

Inder
  • 3,711
  • 9
  • 27
  • 42
  • 1
    Thanks, makes sense I appreciate the explanation – user3437721 Sep 02 '18 at 22:26
  • 2
    Note that this depends on GNU `grep`, and specifically, a GNU `grep` compiled with the optional library libpcre. – Charles Duffy Sep 02 '18 at 22:53
  • 2
    To build on the GNU grep note from Charles, I'll point out that `grep` of any type is not actually *part of bash*, and varies from system to system. BSD systems may have an optional `pcregrep` binary installed with a `pcre` package the behaves similarly to `grep -P`, and macOS can get the same tool using Macports or Brew. – ghoti Sep 02 '18 at 22:58
  • @CharlesDuffy absolutely valid point – Inder Sep 02 '18 at 23:12
0

If you have GNU grep and GNU sed, you can do:

$ cat file | xargs | grep -Po '<div class=total>\K(.*?)</div>' | sed -E 's/<\/div>//; s/, /\n/'
 900 files inspected
28301 offenses detected: 

If you have ruby available:

$ ruby -e 'puts readlines.join[/(?<=<div class="total">).+(?=<\/div>)/m].gsub(/^[ \t]+/m,"")' file 
900 files inspected,
28301 offenses detected:
dawg
  • 98,345
  • 23
  • 131
  • 206