0

i want to peform simple tokenization to count the number of words in html line by line, except the words between <a> tag and the words between <a> tag will count individually

can nltk do this? or there any library can do this?

for example : this the html code

<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>

and i want the output will be

WordsCount : 0 LinkWordsCount : 0
WordsCount : 21 LinkWordsCount : 2
WordsCount : 19 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 2

WordsCount is the number of words in each line except the text between <a> tag. And if there a word appear twice it will be count as two. LinkWordsCount is the number of words in between <a> tag.

so how to make it count line by line except the <a> tag, and the words between <a> tag will count individually.

Thank You.

Kim Hyesung
  • 727
  • 1
  • 6
  • 13
  • I'm having a bit of trouble understanding your question. Can you please show what the current output is and what you want the output to be so we can see how they differ? Thank you – mmenschig Nov 10 '16 at 17:01

2 Answers2

0

Iterate over each line of raw HTML and simply search for links in each line.

In the example below, I am using a very naive way for getting the words count - split the line by spaces (this way - is counted as word and BATAM.TRIBUNNEWS.COM counts as a single word).

from bs4 import BeautifulSoup

html = """
<div class="side-article txt-article">
<p><strong>BATAM.TRIBUNNEWS.COM, BINTAN</strong> - Tradisi pedang pora mewarnai serah terima jabatan pejabat di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> <a href="http://batam.tribunnews.com/tag/bintan/" title="Bintan">Bintan</a>, Senin (3/10/2016).</p>
<p>Empat perwira baru Senin itu diminta cepat bekerja. Tumpukan pekerjaan rumah sudah menanti di meja masing masing.</p>
<p>Para pejabat tersebut yakni AKP Adi Kuasa Tarigan, Kasat Reskrim baru yang menggantikan AKP Arya Tesa Brahmana. Arya pindah sebagai Kabag Ops di <a href="http://batam.tribunnews.com/tag/polres/" title="Polres">Polres</a> Tanjungpinang.</p>
"""

soup = BeautifulSoup(html.strip(), 'html.parser')

for line in html.strip().split('\n'):
    link_words = 0

    line_soup = BeautifulSoup(line.strip(), 'html.parser')
    for link in line_soup.findAll('a'):
        link_words += len(link.text.split())

    # naive way to get words count
    words_count = len(line_soup.text.split())
    print ('WordsCount : {0} LinkWordsCount : {1}'
           .format(words_count, link_words))

Output:

WordsCount : 0 LinkWordsCount : 0
WordsCount : 16 LinkWordsCount : 2
WordsCount : 17 LinkWordsCount : 0
WordsCount : 25 LinkWordsCount : 1

EDIT

If you want to read the HTML from a file, use something like this:

with open(path_to_html_file, 'r') as f:
    html = f.read()
Dušan Maďar
  • 9,269
  • 5
  • 49
  • 64
  • Wow Thanks! its really work. But if i use html from a file. its said AttributeError: 'file' object has no attribute 'strip'. what i have to do if i use a file html as the input? – Kim Hyesung Nov 11 '16 at 05:07
-1

I would suggest to try to go with RegEx in python that is re

To count link words use regex that count href= like this one

RegEx also will help you to find words that don't include < > and by spliting them with space you will have array that you can len and have number of words.

That would be the path I would take.

Community
  • 1
  • 1
BigRetroMike
  • 139
  • 1
  • 10
  • 2
    Please do not suggest to [parse HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454)... – lenz Nov 10 '16 at 23:26