I have a split function which splits strings in an .txt document based on spacing and special characters and converts them to lowercase, in order to count the total number of words present in the document. I'm now trying to extend the regular expression so that entire html comments including all words within them are treated as delimiters, but I can't quite get the updated regex to work correctly.
my @words = split /(?:([_\W\s\d]|(<(\w+)>.*<\/\>)))+/, $text;
#count strings
%count = ();
foreach $word (@words) {
@count{map lc, @keys} =
map lc, delete @count{@keys = keys %count};
$count{$word}++;
}
foreach $key (keys %count) {
print $key, $count{$key};
}
At present the first charcter class
[_\W\s\d]+
worked fine, but I cant get the second
|(<(\w+).*\/\>)+
to function correctly, when used together, the second character class doesnt function correctly and whitespacing is counted as a word. ideally the desired output should split words between spacing and special characters and also split html comments (effectively ignoring any words between comment tags)
I'm not sure whether i'm able to use two character classes in a split function or not? still getting to grips with regex!