4

I want to find how many times a particular word has come in a web page through beautiful soup within that html text ? I tried out the findAll function but finds only words within a particular tag like soup.body.findAll will find the particular word within the body tag but I want it to search that word within all tags that in in the html text. Also once I find that word I need to create a list of word just coming before and after that word, can someone please help me how to do so ? Thanks.

Ritave
  • 1,333
  • 9
  • 25
Kanika Singh
  • 77
  • 1
  • 2
  • 6

1 Answers1

8

According to the newest BeautifulSoup 4 api you can use recursive keyword to find the text in the whole tree. You will have strings that then you can operator on and seperate the words.

Here is a complete example:

import bs4
import re

data = '''
<html>
<body>
<div>today is a sunny day</div>
<div>I love when it's sunny outside</div>
Call me sunny
<div>sunny is a cool word sunny</div>
</body>
</html>
'''

searched_word = 'sunny'

soup = bs4.BeautifulSoup(data, 'html.parser')
results = soup.body.find_all(string=re.compile('.*{0}.*'.format(searched_word)), recursive=True)

print 'Found the word "{0}" {1} times\n'.format(searched_word, len(results))

for content in results:
    words = content.split()
    for index, word in enumerate(words):
        # If the content contains the search word twice or more this will fire for each occurence
        if word == searched_word:
            print 'Whole content: "{0}"'.format(content)
            before = None
            after = None
            # Check if it's a first word
            if index != 0:
                before = words[index-1]
            # Check if it's a last word
            if index != len(words)-1:
                after = words[index+1]
            print '\tWord before: "{0}", word after: "{1}"'.format(before, after)

it outputs:

Found the word "sunny" 4 times

Whole content: "today is a sunny day"
    Word before: "a", word after: "day"
Whole content: "I love when it's sunny outside"
    Word before: "it's", word after: "outside"
Whole content: "
Call me sunny
"
    Word before: "me", word after: "None"
Whole content: "sunny is a cool word sunny"
    Word before: "None", word after: "is"
Whole content: "sunny is a cool word sunny"
    Word before: "word", word after: "None"

Also see here's string keyword reference

Ritave
  • 1,333
  • 9
  • 25
  • results = soup.body.find_all(string=searched_word, recursive=true) NameError: name 'true' is not defined – Kanika Singh Oct 28 '15 at 17:35
  • I have downloaded http://www.crummy.com/software/BeautifulSoup/bs4/download/ version 4.3/ – Kanika Singh Oct 28 '15 at 17:36
  • I updated the answer with the complete, working example, please check it again – Ritave Oct 28 '15 at 17:46
  • I am getting "Found the word "sunny" 0 times" r u using python 2.7.3 ? I just copy pasted ur example code – Kanika Singh Oct 28 '15 at 18:03
  • It seems the ```string``` keyword was added in version 4.4, so use that or change ```soup.body.find_all(string=...)``` to ```soup.body.find_all(text=...)``` (different keyword for 4.3 and before) – Ritave Oct 28 '15 at 18:05
  • Ok lemme download 4.4 then and check – Kanika Singh Oct 28 '15 at 18:08
  • Thanks a lot, Tht issue was with the version. Now I can append these words come before and after the required word ! Thank you once again. I was so troubled, u solved it so easily :) – Kanika Singh Oct 28 '15 at 18:30