I want to find how many times a particular word has come in a web page through beautiful soup within that html text ?
I tried out the findAll
function but finds only words within a particular tag like soup.body.findAll
will find the particular word within the body tag but I want it to search that word within all tags that in in the html text.
Also once I find that word I need to create a list of word just coming before and after that word, can someone please help me how to do so ? Thanks.
Asked
Active
Viewed 1.9k times
4

Ritave
- 1,333
- 9
- 25

Kanika Singh
- 77
- 1
- 2
- 6
-
Possible duplicate of [Using BeautifulSoup to search html for string](http://stackoverflow.com/questions/8936030/using-beautifulsoup-to-search-html-for-string) – Ritave Oct 28 '15 at 16:52
-
No it's not a duplicate, I checked – Kanika Singh Oct 28 '15 at 17:04
1 Answers
8
According to the newest BeautifulSoup 4 api you can use recursive
keyword to find the text in the whole tree. You will have strings that then you can operator on and seperate the words.
Here is a complete example:
import bs4
import re
data = '''
<html>
<body>
<div>today is a sunny day</div>
<div>I love when it's sunny outside</div>
Call me sunny
<div>sunny is a cool word sunny</div>
</body>
</html>
'''
searched_word = 'sunny'
soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)
print 'Found the word "{0}" {1} times\n'.format(searched_word, len(results))
for content in results:
words = content.split()
for index, word in enumerate(words):
# If the content contains the search word twice or more this will fire for each occurence
if word == searched_word:
print 'Whole content: "{0}"'.format(content)
before = None
after = None
# Check if it's a first word
if index != 0:
before = words[index-1]
# Check if it's a last word
if index != len(words)-1:
after = words[index+1]
print '\tWord before: "{0}", word after: "{1}"'.format(before, after)
it outputs:
Found the word "sunny" 4 times
Whole content: "today is a sunny day"
Word before: "a", word after: "day"
Whole content: "I love when it's sunny outside"
Word before: "it's", word after: "outside"
Whole content: "
Call me sunny
"
Word before: "me", word after: "None"
Whole content: "sunny is a cool word sunny"
Word before: "None", word after: "is"
Whole content: "sunny is a cool word sunny"
Word before: "word", word after: "None"

Ritave
- 1,333
- 9
- 25
-
results = soup.body.find_all(string=searched_word, recursive=true) NameError: name 'true' is not defined – Kanika Singh Oct 28 '15 at 17:35
-
I have downloaded http://www.crummy.com/software/BeautifulSoup/bs4/download/ version 4.3/ – Kanika Singh Oct 28 '15 at 17:36
-
I updated the answer with the complete, working example, please check it again – Ritave Oct 28 '15 at 17:46
-
I am getting "Found the word "sunny" 0 times" r u using python 2.7.3 ? I just copy pasted ur example code – Kanika Singh Oct 28 '15 at 18:03
-
It seems the ```string``` keyword was added in version 4.4, so use that or change ```soup.body.find_all(string=...)``` to ```soup.body.find_all(text=...)``` (different keyword for 4.3 and before) – Ritave Oct 28 '15 at 18:05
-
-
Thanks a lot, Tht issue was with the version. Now I can append these words come before and after the required word ! Thank you once again. I was so troubled, u solved it so easily :) – Kanika Singh Oct 28 '15 at 18:30