Count every word in a text and arrangement with any number and how to clean useless symbols in text and print with count

Question

how i can make my Wikipedia output with count all words in this text and arrangement them with the top 10 most existing words and print them without any symbols?

import wikipedia

wikipedia.set_lang("en")
a = wikipedia.page("bitcoin")
words = a.content

print(words)

[How to remove punctuation](https://stackoverflow.com/a/60725620/2308683) + [Find most common elements with count](https://stackoverflow.com/a/27303678/2308683) — OneCricketeer, Jun 21 '22 at 21:39

Felipe Mezzarana · Answer 1 · 2022-06-21T21:56:27.327

Considering that the var words is a string, you can use nltk lib to split your string in a list of words, and then, perform your tasks. Something like that:

import nltk
from nltk.probability import FreqDist

words_list = nltk.word_tokenize(words)
words_frquence = FreqDist(words_list)
words_count = len(words_list)
words_unique_count =  len(set(words_list))

Now, to remove undesired words or symbols, you will need to apply a func in your string, try that:

import re

def nomalize(string):
    clean_string = re.sub(r'Ø|\+','',string) #add '|your symbol' to remove more symbols

    return clean_string

Count every word in a text and arrangement with any number and how to clean useless symbols in text and print with count

1 Answers1