0

how i can make my Wikipedia output with count all words in this text and arrangement them with the top 10 most existing words and print them without any symbols?

import wikipedia

wikipedia.set_lang("en")
a = wikipedia.page("bitcoin")
words = a.content

print(words)
macropod
  • 12,757
  • 2
  • 9
  • 21
  • 1
    [How to remove punctuation](https://stackoverflow.com/a/60725620/2308683) + [Find most common elements with count](https://stackoverflow.com/a/27303678/2308683) – OneCricketeer Jun 21 '22 at 21:39

1 Answers1

1

Considering that the var words is a string, you can use nltk lib to split your string in a list of words, and then, perform your tasks. Something like that:

import nltk
from nltk.probability import FreqDist

words_list = nltk.word_tokenize(words)
words_frquence = FreqDist(words_list)
words_count = len(words_list)
words_unique_count =  len(set(words_list))

Now, to remove undesired words or symbols, you will need to apply a func in your string, try that:

import re

def nomalize(string):
    clean_string = re.sub(r'Ø|\+','',string) #add '|your symbol' to remove more symbols

    return clean_string