-1

I need to extract a list of words without any duplicates. So I would be able to count the number of occurrences of single words

import nltk
import lxml
import bs4
import requests
from nltk.tokenize import word_tokenize, sent_tokenize
wSite="https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"
page=requests.get(wSite).content
soup = bs4.BeautifulSoup(page, "lxml")
z=soup.find_all("p")

container=""
for i in z:
    txt=i.text

    if (txt[1]=='"'):
        container=container+txt
y=container
a=[]
a=y.split()
b=str(a)
Andrea C
  • 3
  • 2
  • 2
    Possible duplicate of [Remove duplicates from list python](https://stackoverflow.com/questions/28802318/remove-duplicates-from-list-python) – Mike Jun 20 '19 at 16:03

1 Answers1

0

I've used spaCy to tokenize the text.

First install spaCy and the spaCy model we will use:

pip install spacy
python -m spacy download en_core_web_sm

It's quite straightforward. We get the web page, concatenate all the text within the <p> elements (ignoring the header and footer), let spaCy do its thang, then remove the non-word tokens before finally giving it to Counter to count the words.

The word counts are in counts. Look at all the print calls to see how to access counts.

import requests
import bs4
import spacy
from collections import Counter

url = "https://www.marxists.org/subject/art/literature/children/texts/orwell/animal-farm/ch01.htm"

page_content = requests.get(url).content
soup = bs4.BeautifulSoup(page_content, "lxml")
text = ""
for paragraph in soup.find_all("p"):
    # We probably don't want text within the header and footer paragraphs
    if paragraph.attrs.get("class", (None,))[0] in ("title", "footer"):
        continue
    text += paragraph.get_text().lower() # It's best to keeps things in one case

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
# Not all tokens are words, so we exclude some
words = tuple(token.text for token in doc if not (token.is_punct or token.is_space or
                                                 token.is_quote or token.is_bracket))
counts = Counter(words)

print("Word count:", len(words)) # Or sum(counts.values())
print("Unique word count:", len(counts))
print("15 most common words:")
for i, (word, count) in enumerate(counts.most_common(15), start=1):
    print(f"{i: >2}. {count: >3} - {word}")

print("The word 'animal' occurs:", counts["animal"])
print("The word 'python' occurs:", counts["python"])
print("All words and their count:")
for word, count in counts.items():
    print(f"{count}, {word}")

Output:

Word count: 2704
Unique word count: 849
15 most common words:
 1. 169 - the
 2.  98 - and
 3.  93 - of
 4.  59 - to
 5.  51 - a
 6.  44 - in
 7.  44 - that
 8.  42 - it
 9.  34 - i
10.  34 - is
11.  33 - was
12.  31 - had
13.  31 - he
14.  27 - you
15.  24 - all
The word 'animal' occurs: 11
The word 'python' occurs: 0
All words and their count:
4, mr
8, jones
93, of
[...]
1, birds
1, jumped
1, perches
GordonAitchJay
  • 4,640
  • 1
  • 14
  • 16