0

I wrote a code for word count in python.

I wanted to get text and frequency of each words from the following page: http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99

Problem is that my program is giving me the word count divided by each verses, but I want it undivided.

Please help me on that.


import requests
from bs4 import BeautifulSoup
import operator


def start(url):
    word_list = []
    source_code = requests.get(url).text  
    soup = BeautifulSoup(source_code, "html.parser")
    for bible_text in soup.findAll('font', {'class': 'tk4l'}):
        content = bible_text.get_text()   
        words = content.lower().split() 
        for each_word in words:
            word_list.append(each_word)
        clean_up_list(word_list)


def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:                                  
        symbols = "~!@#$%^&*()_+`{}|\"?><`-=\][';/.,']"
        for i in range(0, len(symbols)):
            word = word.replace(symbols[i], "")               
        if len(word) > 0:     
            clean_word_list.append(word)
    create_dictionary(clean_word_list)


def create_dictionary(clean_word_list):
    word_count = {}
    for word in clean_word_list:
        if word in word_count:
            word_count[word] += 1
        else:
            word_count[word] = 1  
    for key, value in sorted(word_count.items(),key=operator.itemgetter(0)):
        print(key, value)                  


start('http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99')

Yun Tae Hwang
  • 1,249
  • 3
  • 18
  • 30

1 Answers1

2

You are building a fresh word_count dictionary for every verse and then you printing out the word_count for only this verse. Instead you need to have only one instance of word_count.

Update: There were other problems with the code, plus you should use regular expressions to remove all non-alphanumeric characters, plus you should use collections.Counter, as it makes your code a lot shorter, and, as a nice side effect, let's you retrieve the most common words:

import requests
import re
from bs4 import BeautifulSoup
from collections import Counter


def parse(url):
    html = requests.get(url).text
    soup = BeautifulSoup(html, "html.parser")
    count = Counter()
    for bible_text in soup.findAll('font', {'class': 'tk4l'}):
        text = re.sub("[^\w0-9 ]", "", bible_text.get_text().lower())
        count.update(text.split(" "))
    return count

word_count = parse('http://www.holybible.or.kr/B_NIV/cgi/bibleftxt.php?VR=NIV&VL=1&CN=1&CV=99')
print(word_count.most_common(10))

Output:

[('the', 83), ('and', 71), ('god', 30), ('was', 29), ('to', 22), ('it', 17), ('of', 16), ('there', 16), ('that', 15), ('in', 15)]
hansaplast
  • 11,007
  • 2
  • 61
  • 75
  • Thank you, but why would you avoid global word count? – Yun Tae Hwang Jan 26 '17 at 06:55
  • it's in general bad practice, see e.g. [here](http://stackoverflow.com/questions/19158339/why-are-global-variables-evil). In your example you might face problems once you would start using threads for speeding up – hansaplast Jan 26 '17 at 07:11
  • Now, It seems like the frequency of each word is added to on top of the frequency of the same words that already showed up. for example, suppose the word "father" showed up in verse 2 and another "father" showed up in verse 5. Then, it end up giving me 3 "fathers" verse 2 [father] verse 3 [father] verse 4 [father] verse 5 [father, father] -------- so total 5 it is kind of hard to explain,, but the numbers are inaccurate... – Yun Tae Hwang Jan 26 '17 at 09:06
  • I don't understand. That's what you said in your question, no? "my program is giving me the word count divided by each verses, but I want it undivided.", so you want a general count of e.g. "father" summed up over all the verses, no? – hansaplast Jan 26 '17 at 09:12
  • hmm.. I meant... if "father" appears in verse 2 and 5. That should be father 2 but it gives me father 5. so like verse 2 [father] verse 3 [father] verse 4 [father] verse 5 [father, father]... – Yun Tae Hwang Jan 26 '17 at 09:39
  • @YunTaeHwang, oh, in this case there were other problems in the code. I did a complete rewrite. I think it now does what it should – hansaplast Jan 26 '17 at 12:45