1

I have to do a report on how many times a certain css class appears in the content of our pages (over 10k pages). The trouble is, the header and footer contains that class, so a grep returns every single page. (not useful)

So, how do I grep for content?

I have a <!-- main content --> and a <!-- end content --> comment on every page.

So how do I grep (do I even grep?) for what is between those comments?

This is hosted on a linux server, and I have access to Grep, Awk and Sed.

Ideally, I would get a report (.txt or .csv) with pages and line numbers where the class shows up, but just a list of the pages themselves would suffice.

Thanks!

JonYork
  • 1,223
  • 8
  • 31
  • 52
  • Use the appropriate part of [this answer](http://stackoverflow.com/a/17914105/258523) to filter to just that section of the page then write your other matching to occur when in that section. – Etan Reisner Jun 18 '15 at 23:07
  • [You can't parse HTML with regex.](http://stackoverflow.com/a/1732454/282912) – msw Jun 19 '15 at 08:01

3 Answers3

1

The following script performs exactly what you requested: Print the files and the line numbers where the CSS class name occurs:

#!/bin/sh
pattern="class=\"([A-Za-z0-9_-]* )*$1( [A-Za-z0-9_-]*)*\""

awk -v pat="$pattern" '
   /<!-- main content -->/ {Y=1}
   /<!-- end content -->/ {Y=0}
   Y && $0 ~ pat {f[FILENAME] = f[FILENAME]" "FNR;}
   END {for (k in f) printf "%s\tlines:%s\n", k,f[k];}
' *.html

Save it as class_find.sh use it like this:

class_find.sh 'my_class'

where my_class is the class name you want to search for.

Output:

2.html  lines: 7 9
1.html  lines: 5

Some explanation:

  • pattern="class=\"([A-Za-z0-9_-]* )*$1( [A-Za-z0-9_-]*)*\"" : search for class="my_class" or class="others my_class" or class="my_class others"
  • /<!-- main content -->/ {Y=1} : when this string is found, set flag Y to true, /<!-- end content -->/ {Y=0} : set flag Y to false
  • Y && $0 ~ pat {f[FILENAME] = f[FILENAME]" "FNR;} : if the flag Y is true and a match for the class is found in the current line ($0), save the line number to the associative array f with key the filename.
  • END {for (k in f) printf "%s\tlines:%s\n", k,f[k];} : after reading all files print the results in a nice format
  • *.html : Operate on the html files found in the current directory
henfiber
  • 1,209
  • 9
  • 11
  • Nice solution, but the problem with `pattern="class=[\"A-Za-z0-9 _-]*$1[\"A-Za-z0-9 _-]*"` is when you `class_find.sh 'my_class'`, it will also match `class="my_class_small"` and other variations. – Nathan Wilson Jun 20 '15 at 15:00
0

You can use sed to search for CLASS between the comment tags. Then pipe the output to wc if you just want to count the results.

sed -n '/<!-- main content -->/,/<!-- end content -->/ s/CLASS/CLASS/p' filename | wc -l
Nathan Wilson
  • 856
  • 5
  • 12
0
sed '# remove ":" used for counting later, use another char if ":" is part of the class
   s/://g
# load whole file
   H;$!d
   x
# remove header/trailer
   s/.*/<!-- main content -->/\(.*\)/<!-- end content -->.*/\1/
# keep occurence of my class only
   s/MyClass/:/g;s/[^:]//
# count (max 999)
   s/^$/0/;t
   s/:\{100\}/C/g;s/:\{10\}/D/g
   s/^[^C]/0&/;s/\([0C]\)\(:*\)$/\10\2/;s/[^:]$/&0/
   s/\([^0]\)\1\{8\}/9u/g
   s/\([^0]\)\1\{7\}/8u/g
   s/\([^0]\)\1\{6\}/7u/g
   s/\([^0]\)\1\{5\}/6u/g
   s/\([^0]\)\1\{4\}/5u/g
   s/\([^0]\)\1\{3\}/4u/g
   s/\([^0]\)\1\{2\}/3u/g
   s/\([^0]\)\1/2u/g
   s/[CD:]/1/g
   s/u//g
   ' YourFile

All in one (sed is not a champion for counting so you can use a | wc -c instead of counting part (be carefull with the 0 occurence in this case)

  • take care that class could be everywhere on a line (even several times) and HTML like file are not 1 instruction per line where a grep assume this when counting (count the line, not the occurence on line)
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43