I need to grep all the punctuation's in the Markup language Content.
My Input Sample content:
__DATA__
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="de" ><strong>Kerala unterscheidet</strong> smtp://suriya@edu/tester sich von anderen indischen netftp://suriya@edu Bundesstaaten: Es ist sauberer,
der;Verkehr
nichtso.chaotisch
, und Kirchen säumen die Straßen. Die Region einmalig machen aber die Backwaters <a href="http://www.cochin.org">www.cochin.org</a><link rel="stylesheet" type="text/css" href="../styles/9783734317873.css"/>
I am using [[:punct:]]
however these nodes will fetch all the occurrences in the content.
my $text = do { local $/; <DATA> };
while($text=~m/(.){5}[[:punct:]](.){10}/g)
{
print "L: $&\n";
}
Output
k rel="styleshee
type="text/css"
href="../styles
g src="../images
17873_140_1.jpg"
alt="image" cla
s nat&x00FC;rlic
xmlns="http://ww
3.org/1999/xhtml
" xml:lang="de"
ioses:Zeugnis na
x00FC;rlicher Pe
ugnis.nat&x00FC;
But I need to omit the punctuation in the element attributes and on their values. How can I list the punctuation's which is available in the content.
To be avoided : www.w3.org
and "../styles/97
Needs to be find: der;Verkeh
and so.chaotisch
Question Updated:
Do not remove any content or html elements to get the punctuation's in the string Since we need to get the exact line number and exact column number. If we removed the html elements column number must be changed.
Could someone help me on this one.