I've used spaCy to tokenize the text.
First install spaCy and the spaCy model we will use:
pip install spacy
python -m spacy download en_core_web_sm
It's quite straightforward. We get the web page, concatenate all the text within the <p>
elements (ignoring the header and footer), let spaCy do its thang, then remove the non-word tokens before finally giving it to Counter to count the words.
The word counts are in counts
. Look at all the print
calls to see how to access counts
.
import requests
import bs4
import spacy
from collections import Counter
url = "https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"
page_content = requests.get(url).content
soup = bs4.BeautifulSoup(page_content, "lxml")
text = ""
for paragraph in soup.find_all("p"):
# We probably don't want text within the header and footer paragraphs
if paragraph.attrs.get("class", (None,))[0] in ("title", "footer"):
continue
text += paragraph.get_text().lower() # It's best to keeps things in one case
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
# Not all tokens are words, so we exclude some
words = tuple(token.text for token in doc if not (token.is_punct or token.is_space or
token.is_quote or token.is_bracket))
counts = Counter(words)
print("Word count:", len(words)) # Or sum(counts.values())
print("Unique word count:", len(counts))
print("15 most common words:")
for i, (word, count) in enumerate(counts.most_common(15), start=1):
print(f"{i: >2}. {count: >3} - {word}")
print("The word 'animal' occurs:", counts["animal"])
print("The word 'python' occurs:", counts["python"])
print("All words and their count:")
for word, count in counts.items():
print(f"{count}, {word}")
Output:
Word count: 2704
Unique word count: 849
15 most common words:
1. 169 - the
2. 98 - and
3. 93 - of
4. 59 - to
5. 51 - a
6. 44 - in
7. 44 - that
8. 42 - it
9. 34 - i
10. 34 - is
11. 33 - was
12. 31 - had
13. 31 - he
14. 27 - you
15. 24 - all
The word 'animal' occurs: 11
The word 'python' occurs: 0
All words and their count:
4, mr
8, jones
93, of
[...]
1, birds
1, jumped
1, perches