1

Let's say I have html file with divs like that:

<div class="message" title="user1"> <span> Hey </span> </div>
<div class="message" title="user1"> <span> It's me </span> </div>
<div class="message" title="user2"> <span> Hi </span> </div>
<div class="message" title="user3"> <span> Ola </span> </div>

How can I get list of all users sending messages?

If I use find method I get only first user, if I use find_all I get user1 two times.

Can I somehow make it in one step without deleting duplicates in list made by find_all?

Ouroborus
  • 16,237
  • 4
  • 39
  • 62
Slajni
  • 105
  • 1
  • 10
  • You can't do it "in one step without deleting duplicates". The normal procedure is to grab all matching elements and then filter those results for uniqueness. – Ouroborus Dec 16 '18 at 20:39
  • Yes, @Ouroborus is right. You'll want to make the returned list into a set. https://stackoverflow.com/a/12897419/1487413 – Martin Burch Dec 16 '18 at 20:46

2 Answers2

1

here's the 2 ways I can only think of doing it:

import bs4

r = '''<div class="message" title="user1"> <span> Hey </span> </div>
<div class="message" title="user1"> <span> It's me </span> </div>
<div class="message" title="user2"> <span> Hi </span> </div>
<div class="message" title="user3"> <span> Ola </span> </div>'''

soup = bs4.BeautifulSoup(r,'html.parser')
messages = soup.find_all('div', {'class':'message'})

users_list = []   

for user in messages:
    user_id = user.get('title')
    if user_id not in users_list:
        users_list.append(user_id)

or

import bs4

r = '''<div class="message" title="user1"> <span> Hey </span> </div>
<div class="message" title="user1"> <span> It's me </span> </div>
<div class="message" title="user2"> <span> Hi </span> </div>
<div class="message" title="user3"> <span> Ola </span> </div>'''

soup = bs4.BeautifulSoup(r,'html.parser')
messages = soup.find_all('div', {'class':'message'})

users_list = list(set([ user.get('title') for user in messages ]))
chitown88
  • 27,527
  • 4
  • 30
  • 59
1

You could use a custom finder function

seen_users = set()
def users(tag):
    username = tag.get('title')
    if username and 'message' in tag.get('class', ''):
        seen_users.add(username)
        return True

tags = soup.find_all(users)
print(seen_users)  # {'user1', 'user2', 'user3'}
sytech
  • 29,298
  • 3
  • 45
  • 86
  • This is similar to the `get` method on dictionaries. Basically you retrieve the `class` attribute of the tag. If it's not present, the value `''` (an empty string) is returned instead. We use this because the default `None` would cause an error. Similar if you tried `if 'message' in None` will raise an error. – sytech Dec 16 '18 at 22:02
  • Oh, now I understand. Also, in `tags = soup.find_all(users)` shouldn't I pass the argument? If I understand well tags is the value holding html code right? – Slajni Dec 16 '18 at 22:04
  • When you pass a callable (a function) to `find_all` it is called for every tag in `soup`. If the function returns `True`, the tag is added to the result set. See the [beautifulsoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-function) docs for details. – sytech Dec 16 '18 at 22:06