I'm looking for a way to determine the code to text ratio of a web page in Perl. Not looking for anything complex just a simple print out like HTML Code:75% Text:25% just for SEO Reasons.
Asked
Active
Viewed 165 times
-1
-
I would like to take a webpage as a file, place into a variable and the determine the percentage that is HTML code and the percentage that is visible text. – Blnukem Mar 07 '12 at 14:41
-
1HTML tags are rarely "code". How about using the term "markup"? – mob Mar 07 '12 at 18:13
-
HTML tags are always "code". They aren't "program code" but they are "code". (As is ROT13) – Quentin Mar 07 '12 at 22:36
2 Answers
4
Use HTML::TreeBuilder to strip out the text.
#!/usr/bin/perl
use strict;
use warnings;
use v5.10;
use LWP::Simple;
use HTML::TreeBuilder;
my $content = get(shift @ARGV);
die "Couldn't get it!" unless defined $content;
my $text = HTML::TreeBuilder->new_from_content($content)->as_text;
my $html_size = length $content;
my $text_size = length $text;
my $percentage = 100 * ( $text_size / $html_size );
say qq[$percentage%];

Quentin
- 914,110
- 126
- 1,211
- 1,335
-2
Hmm... thinking quickly... How about:
my $htmllength = 0;
my $textlength = 0;
while(<>) {
s/(<[^>]*>)/$htmllength += length($1); "";/eg;
$textlength += length($_);
}
print "HTML Code: " . (100 * $htmllength / ($htmllength + $textlength)) . "\n";
print "Text : " . (100 * $textlength / ($htmllength + $textlength)) . "\n";
You can then simply run the script on the file(s) in question:
perl SCRIPT file1.html file2.html
NOTE: this will not work if your data contains any CDATA fields

Wes Hardaker
- 21,735
- 2
- 38
- 69
-
1[You can't parse \[X\]HTML with regex](http://stackoverflow.com/a/1732454/119280)! – DVK Mar 07 '12 at 15:33
-
Nope, but I'm not parsing it either. However, I am letting content through that may otherwise be invisible to the presentation. – Wes Hardaker Mar 07 '12 at 16:34
-
Thanks for all the negative votes! Show me a file that it fails with and I'll delete the post! – Wes Hardaker Mar 08 '12 at 14:46
-
-
-
-
It will also add extra markup-as-content if there is a `>` character in an attribute value. – Quentin Mar 08 '12 at 15:54