How can I create a dictionary for a large amount to text and list the most frequent word?

Question

I am new to coding and I am trying to create a dictionary from a large body of text and would also like the most frequent word to be shown?

For example, if I had a block of text such as:

text = '''George Gordon Noel Byron was born, with a clubbed right foot, in London on January 22, 1788. He was the son of Catherine Gordon of Gight, an impoverished Scots heiress, and Captain John (“Mad Jack”) Byron, a fortune-hunting widower with a daughter, Augusta. The profligate captain squandered his wife’s inheritance, was absent for the birth of his only son, and eventually decamped for France as an exile from English creditors, where he died in 1791 at 36.'''

I know the steps I would like the code to take. I want words that are the same but capitalised to be counted together so Hi and hi would count as Hi = 2.

I am trying to get the code to loop through the text and create a dictionary showing how many times each word appears. My final goal is to them have the code state which word appears most frequently.

I don't know how to approach such a large amount of text, the examples I have seen are for a much smaller amount of words.

I have tried to remove white space and also create a loop but I am stuck and unsure if I am going the right way about coding this problem.

a.replace(" ", "")  
#this gave built-in method replace of str object at 0x000001A49AD8DAE0>, I have now idea what this means!

print(a.replace) # this is what I tried to write to remove white spaces

I am unsure of how to create the dictionary.

To count the word frequency would I do something like:

frequency = {}
for value in my_dict.values() :
    if value in frequency :
        frequency[value] = frequency[value] + 1
    else :
        frequency[value] = 1

What I was expecting to get was a dictionary that lists each word shown with a numerical value showing how often it appears in the text.

Then I wanted to have the code show the word that occurs the most.

Look into [collections.Counter](https://docs.python.org/3/library/collections.html#collections.Counter). You will also need to do some additional work in your code to remove non-word characters, lowercase alpha characters, etc. — benvc, Jul 09 '19 at 20:06
How do I break up such a large block of text? if it was a simple string I could just manually enter the key and value for each word so ``` statesAndCapitals = { 'Gujarat' : 'Gandhinagar', 'Maharashtra' : 'Mumbai', 'Rajasthan' : 'Jaipur', 'Bihar' : 'Patna' } print('List Of given states and their capitals:\n') # Iterating over values for state, capital in statesAndCapitals.items(): print(state, ":", capital) ``` — Newtocode, Jul 09 '19 at 20:12
https://stackoverflow.com/questions/6181763/converting-a-string-to-a-list-of-words — benvc, Jul 09 '19 at 20:16

score 0 · Answer 1 · answered Jul 09 '19 at 20:11

This may be too simple for your requirements, but you could do this to create a dictionary of each word and its number of repetitions in the text.

text = "..." # text here.
frequency = {}
for word in text.split(" "):
    if word not in frequency.keys():
        frequency[word] = 1

    else:
        frequency[word] += 1

print(frequency)

This only splits the text up at each ' ' and counts the number of each occurrence. If you want to get only the words, you may have to remove the ',' and other characters which you do not wish to have in your dictionary.

To remove characters such as ',' do.

text = text.replace(",", "")

Hope this helps and happy coding.

I had tried to use text = text.replace(",", "") to remove the characters that were not words but for some reason my code is not doing it....should this be inside the loop? — Newtocode, Jul 09 '19 at 20:24

ofthegoats · Answer 2 · 2019-07-09T20:50:42.583

First, to remove all non-alphabet characters, aside from ', we can use regex
After that, we go through a list of the words and use a dictionary

import re
d = {}
text = text.split(" ")#turns it into a list
text = [re.findall("[a-zA-Z']", text[i]) for i in range(len(text))]
#each word is split, but non-alphabet/apostrophe are removed  
text = ["".join(text[i]) for i in range(len(text))]
#puts each word back together
#there may be a better way for the short-above. If so, please tell.
for word in text:
    if word in d.keys():
        d[word] += 1
    else:
        d[word] = 1
d.pop("")
#not sure why, but when testing I got one key ""

score 0 · Answer 3 · 2019-07-09T21:12:02.067

0

You can use regex and Counter from collections :

import re
from collections import Counter

text = "This cat is not a cat, even if it looks like a cat"

# Extract words with regex, ignoring symbols and space
words = re.compile(r"\b\w+\b").findall(text.lower())

count = Counter(words)
# {'cat': 3, 'a': 2, 'this': 1, 'is': 1, 'not': 1, 'even': 1, 'if': 1, 'it': 1, 'looks': 1, 'like': 1}

# To get the most frequent
most_frequent = max(count, key=lambda k: count[k])
# 'cat'

edited Jul 09 '19 at 21:12

answered Jul 09 '19 at 21:06

Is there a way to do this without using regex or using a counter? I am trying to write the code so it is very simple for me to follow as I am very new to coding – Newtocode Jul 09 '19 at 22:04

How can I create a dictionary for a large amount to text and list the most frequent word?

3 Answers3