0

I am hoping to get a count of how often a specific word shows on a given URL. I currently have a way to do this for a small set of URLs and a single word:

import requests
from bs4 import BeautifulSoup

url_list = ["https://www.example.org/","https://www.example.com/"]

#the_word = input()
the_word = 'Python'

total_words = []
for url in url_list:
    r = requests.get(url, allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    words = soup.find_all(text=lambda text: text and the_word.lower() in text)
    count = len(words)
    words_list = [ ele.strip() for ele in words ]
    for word in words:
        total_words.append(word.strip())

    print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
    print(words_list)


#print(total_words)
total_count = len(total_words)

However, my hope is to be able to do this for a mapped set of words to their respective URLs as shown in the below data frame.

Target Word Target URL
word1 www.example.com/topic-1/
word2 www.example.com/topic-2/

The output would ideally give me a new column with a count of how often the word shows on its associated URL. For example, how often 'word1' shows on 'www.example.com/topic-1/'.

Any and all help is much appreciated!

Alex Fuss
  • 115
  • 7

2 Answers2

1

Just iterate over your structure - dict, list of dicts, ... Following example will just point in a direction, cause your question is not that clear and is missing an exact expected result. I am sure you could adapt it to your special needs.

Example

import requests
from bs4 import BeautifulSoup
import pandas as pd

data = [
    {'word':'Python','url':'https://stackoverflow.com/questions/tagged/python'},
    {'word':'Question','url':'https://stackoverflow.com/questions/tagged/python'}
]

for item in data:
    r = requests.get(item['url'], allow_redirects=False)
    soup = BeautifulSoup(r.content.lower(), 'lxml')
    count = soup.body.get_text(strip=True).lower().count(item['word'].lower())
    item['count'] = count

pd.DataFrame(data)

Output

NOTE: Depending on what you want to determine the word frequency, you should consider the following:

  1. human readable is to be extracted separately from the html e.g. with BeautifulSoup.
  2. depending on how the content of the web page is provided static / dynamic the tool has to be chosen. For dynamic content, for example, selenium is to be preferred, because unlike requests it also renders JavaScript.
HedgeHog
  • 22,146
  • 4
  • 14
  • 36
  • Thanks! this is helpful. The output you've shown is exactly what I am looking for. I am using a csv that I have turned into a list of dictionaries similar to what you have above. However I am having trouble looping over the list of dicts to get the same output. Thoughts? – Alex Fuss Jan 04 '22 at 14:52
  • Thanks again! Updated question here https://stackoverflow.com/questions/70581444/word-frequency-by-iterating-over-a-list-of-dictionaries-python – Alex Fuss Jan 04 '22 at 15:49
  • @AlexFuss Saw you found your answer, great - Also added a note with some context, to my answer, in case you are dealing with dynamically served content. – HedgeHog Jan 04 '22 at 18:28
0

You should try the count() Method for the string And with your code, it will look like this:

count = url.count(the_word)
print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
Draugael
  • 46
  • 4