Count the frequency of a specific word on a specific URL - Python

Question

I am hoping to get a count of how often a specific word shows on a given URL. I currently have a way to do this for a small set of URLs and a single word:

import requests
from bs4 import BeautifulSoup

url_list = ["https://www.example.org/","https://www.example.com/"]

#the_word = input()
the_word = 'Python'

total_words = []
for url in url_list:
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    words = soup.find_all(text=lambda text: text and the_word.lower() in text)
    count = len(words)
    words_list = [ ele.strip() for ele in words ]
    for word in words:
        total_words.append(word.strip())

    print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
    print(words_list)


#print(total_words)
total_count = len(total_words)

However, my hope is to be able to do this for a mapped set of words to their respective URLs as shown in the below data frame.

Target Word	Target URL
word1	www.example.com/topic-1/
word2	www.example.com/topic-2/

The output would ideally give me a new column with a count of how often the word shows on its associated URL. For example, how often 'word1' shows on 'www.example.com/topic-1/'.

Any and all help is much appreciated!

Have you tried using `str.count()`? – Sylvester Kruin Jan 03 '22 at 17:44 — Sylvester Kruin, Jan 03 '22 at 17:44

HedgeHog · Accepted Answer · 2022-01-04T18:24:31.443

Just iterate over your structure - dict, list of dicts, ... Following example will just point in a direction, cause your question is not that clear and is missing an exact expected result. I am sure you could adapt it to your special needs.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = [
    {'word':'Python','url':'https://stackoverflow.com/questions/tagged/python'},
    {'word':'Question','url':'https://stackoverflow.com/questions/tagged/python'}
]

for item in data:
    r = requests.get(item['url'], allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    count = soup.body.get_text(strip=True).lower().count(item['word'].lower())
    item['count'] = count

pd.DataFrame(data)

Output

word	url	count
Python	https://stackoverflow.com/questions/tagged/python	93
Question	https://stackoverflow.com/questions/tagged/python	13

NOTE: Depending on what you want to determine the word frequency, you should consider the following:

human readable is to be extracted separately from the html e.g. with BeautifulSoup.
depending on how the content of the web page is provided static / dynamic the tool has to be chosen. For dynamic content, for example, selenium is to be preferred, because unlike requests it also renders JavaScript.

Thanks! this is helpful. The output you've shown is exactly what I am looking for. I am using a csv that I have turned into a list of dictionaries similar to what you have above. However I am having trouble looping over the list of dicts to get the same output. Thoughts? — Alex Fuss, Jan 04 '22 at 14:52
Thanks again! Updated question here https://stackoverflow.com/questions/70581444/word-frequency-by-iterating-over-a-list-of-dictionaries-python — Alex Fuss, Jan 04 '22 at 15:49
@AlexFuss Saw you found your answer, great - Also added a note with some context, to my answer, in case you are dealing with dynamically served content. — HedgeHog, Jan 04 '22 at 18:28

score 0 · Answer 2 · answered Jan 03 '22 at 18:17

0

You should try the count() Method for the string And with your code, it will look like this:

count = url.count(the_word)
print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))

answered Jan 03 '22 at 18:17

Draugael

46
4

Count the frequency of a specific word on a specific URL - Python

2 Answers2

Example

Output