i want to peform simple tokenization to count the number of words in html line by line, except the words between <a>
tag and the words between <a>
tag will count individually
can nltk do this? or there any library can do this?
for example : this the html code
<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>
and i want the output will be
WordsCount : 0 LinkWordsCount : 0
WordsCount : 21 LinkWordsCount : 2
WordsCount : 19 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 2
WordsCount is the number of words in each line except the text between <a>
tag. And if there a word appear twice it will be count as two.
LinkWordsCount is the number of words in between <a>
tag.
so how to make it count line by line except the <a>
tag, and the words between <a>
tag will count individually.
Thank You.