0

I'd like to count the frequency of a list of words in a specific website. The code however doesn't return the exact number of words that a manual "control F" command would. What am I doing wrong?

Here's my code:

import pandas as pd
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import re

url='https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr=[] 
wanted = ['tender','2020','date']    
for word in wanted:
    a=requests.get(url).text.count(word)
    dic={'phrase':word,
          'frequency':a,              
            }          
    fr.append(dic)  
    print('Frequency of',word, 'is:',a)
data=pd.DataFrame(fr)    
  • Read [this article](https://ericlippert.com/2014/03/05/how-to-debug-small-programs/) for tips about debugging your code. – Code-Apprentice Apr 27 '21 at 22:49
  • 4
    One thing to be aware of: `requests` might not give you the exact same text as you see in your browser. This can happen, for example, if the web page has JavaScript code that modifies the contents of the page. Your browser executes that code, but requests will not. On the other hand, `selenium` will give you exactly the same thing as you see in your browser. If you know there is JavaScript code, then you should use `selenium` instead of `requests`. – Code-Apprentice Apr 27 '21 at 22:51
  • Please supply the expected [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) (MRE). We should be able to copy and paste a contiguous block of your code, execute that file, and reproduce your problem along with tracing output for the problem points. This lets us test our suggestions against your test data and desired output. – Prune Apr 27 '21 at 22:53
  • In particular, what are the specific discrepancies? Which is the correct value, and why? What do the interface documents say about their operation? Perhaps they have different definitions of counting a given word, such that a difference is actually the correct response. – Prune Apr 27 '21 at 22:53

2 Answers2

1

When I tried your code on the word "Tender", a=requests.get(url).text.count(word) returned many more results than ctrl + F, which was weird because I was expecting to return less ( text.count is case-sensitive, HTML sometimes breaks elements into multiple lines and all that ). But by printing the variable "a" and going through it you'll notice there are elements that aren't displayed on the page, also that there are plenty of "Tender" between tags. I'd advise you to use BeautifulSoup or find some way to avoid going through the invisible text.

And by the way, little thing, you can put the requests.get(url).text as a variable out of the loop so you don't have to send a request at every iteration.

Khoa Nguyen
  • 1,319
  • 7
  • 21
1

Refer to the comments in your question to see why using requests might be a bad idea to count the frequency of a word in the "visible spectrum" of a webpage (what you actually see in the browser).

If you want to go about this with selenium, you could try:

from selenium import webdriver

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'

driver = webdriver.Chrome(chromedriver_location)
driver.get(url)
body = driver.find_element_by_tag_name('body')

fr = [] 
wanted = ['tender', '2020', 'date']    
for word in wanted:
    freq = body.text.lower().count(word) # .lower() to account for count's case sensitive behaviour
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

which gave me the same results that a CTRL + F does.

You can test BeautifulSoup too (which you're importing by the way) by modifying your code a little bit:

import requests
from bs4 import BeautifulSoup

url = 'https://www.gov.uk/government/publications/specialist-quality-mark-tender-2016'
fr = [] 
wanted = ['tender','2020','date']    
a = requests.get(url).text
soup = BeautifulSoup(a, 'html.parser')
for word in wanted:
    freq = soup.get_text().lower().count(word)
    dic = {'phrase': word, 'frequency': freq}          
    fr.append(dic)  
    print('Frequency of', word, 'is:', freq)

That gave me the same results, except for the word tender, which according to BeautifulSoup appears 12 times, and not 11. Test them out for yourself and see what suits you.

Camilo Martinez M.
  • 1,420
  • 1
  • 7
  • 21
  • This is excellent! thank you so much for your insight Camilo !! – Fatima El Mansouri Apr 28 '21 at 04:41
  • Selenium worked perfectly for me! This is however only a snippet from a code which loops through a dataframe containing URLs and counts specific keywords for each URL. I have 20+ URLs in the DF, is there a way to not have that many windows open while looping through the URLs with Selenium? Thank you again for your great answer! – Fatima El Mansouri Apr 28 '21 at 05:00
  • 1
    I'm glad it helped you. Regarding the opened browser windows, I am not sure. I haven't tried this but a quick search lead me here (the second answer, not the accepted one) https://stackoverflow.com/questions/7593611/selenium-testing-without-browser. That should get you going – Camilo Martinez M. Apr 28 '21 at 07:58
  • Thank you again, I appreciate you taking the time to answer :) !! This worked for me as well! Have a great day ! – Fatima El Mansouri Apr 29 '21 at 09:34