Grep string between two html comments in pages

Question

I have to do a report on how many times a certain css class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page. (not useful)

So, how do I grep for content?

I have a  and a  comment on every page.

So how do I grep (do I even grep?) for what is between those comments?

This is hosted on a linux server, and I have access to Grep, Awk and Sed.

Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.

Thanks!

Use the appropriate part of [this answer](http://stackoverflow.com/a/17914105/258523) to filter to just that section of the page then write your other matching to occur when in that section. — Etan Reisner, Jun 18 '15 at 23:07
[You can't parse HTML with regex.](http://stackoverflow.com/a/1732454/282912) — msw, Jun 19 '15 at 08:01

henfiber · Answer 1 · 2015-06-20T21:25:59.313

The following script performs exactly what you requested: Print the files and the line numbers where the CSS class name occurs:

#!/bin/sh
pattern="class=\"([A-Za-z0-9_-]* )*$1( [A-Za-z0-9_-]*)*\""

awk -v pat="$pattern" '
   /<!-- main content -->/ {Y=1}
   /<!-- end content -->/ {Y=0}
   Y && $0 ~ pat {f[FILENAME] = f[FILENAME]" "FNR;}
   END {for (k in f) printf "%s\tlines:%s\n", k,f[k];}
' *.html

Save it as class_find.sh use it like this:

class_find.sh 'my_class'

where my_class is the class name you want to search for.

Output:

2.html  lines: 7 9
1.html  lines: 5

Some explanation:

pattern="class=\"([A-Za-z0-9_-]* )*$1( [A-Za-z0-9_-]*)*\"" : search for class="my_class" or class="others my_class" or class="my_class others"
// {Y=1} : when this string is found, set flag Y to true, // {Y=0} : set flag Y to false
Y && $0 ~ pat {f[FILENAME] = f[FILENAME]" "FNR;} : if the flag Y is true and a match for the class is found in the current line ($0), save the line number to the associative array f with key the filename.
END {for (k in f) printf "%s\tlines:%s\n", k,f[k];} : after reading all files print the results in a nice format
*.html : Operate on the html files found in the current directory

Nice solution, but the problem with `pattern="class=[\"A-Za-z0-9 _-]*$1[\"A-Za-z0-9 _-]*"` is when you `class_find.sh 'my_class'`, it will also match `class="my_class_small"` and other variations. — Nathan Wilson, Jun 20 '15 at 15:00

score 0 · Answer 2 · answered Jun 18 '15 at 23:14

0

You can use sed to search for CLASS between the comment tags. Then pipe the output to wc if you just want to count the results.

sed -n '/<!-- main content -->/,/<!-- end content -->/ s/CLASS/CLASS/p' filename | wc -l

answered Jun 18 '15 at 23:14

Nathan Wilson

856
5
12

NeronLeVelu · Answer 3 · 2015-06-19T07:08:58.747

sed '# remove ":" used for counting later, use another char if ":" is part of the class
   s/://g
# load whole file
   H;$!d
   x
# remove header/trailer
   s/.*/<!-- main content -->/\(.*\)/<!-- end content -->.*/\1/
# keep occurence of my class only
   s/MyClass/:/g;s/[^:]//
# count (max 999)
   s/^$/0/;t
   s/:\{100\}/C/g;s/:\{10\}/D/g
   s/^[^C]/0&/;s/\([0C]\)\(:*\)$/\10\2/;s/[^:]$/&0/
   s/\([^0]\)\1\{8\}/9u/g
   s/\([^0]\)\1\{7\}/8u/g
   s/\([^0]\)\1\{6\}/7u/g
   s/\([^0]\)\1\{5\}/6u/g
   s/\([^0]\)\1\{4\}/5u/g
   s/\([^0]\)\1\{3\}/4u/g
   s/\([^0]\)\1\{2\}/3u/g
   s/\([^0]\)\1/2u/g
   s/[CD:]/1/g
   s/u//g
   ' YourFile

All in one (sed is not a champion for counting so you can use a | wc -c instead of counting part (be carefull with the 0 occurence in this case)

take care that class could be everywhere on a line (even several times) and HTML like file are not 1 instruction per line where a grep assume this when counting (count the line, not the occurence on line)

Grep string between two html comments in pages

3 Answers3