Use BeautifulSoup to get profile picture without class name

Question

when I learn BeautifulSoup library and try to crawl a webpage, I can limit the search result by limiting the attributes like: a, class name = user-name, which can be found by inspecting the HTML source.

Here is a success example:

    <a href="https://thenewboston.com/profile.php?user=2" class="user-name">
                                            Bucky Roberts </a>

I can easily tell

    soup = BeautifulSoup(plain_text,'html.parser')
    for link in soup.findAll('a', {'class': 'user-name'}):

However, when I try to get the profile photo's link, I see the code below by inspecting:

    <div class="panel profile-photo">
        <a href="https://thenewboston.com/profile.php?user=2">
            <img src="/photos/users/2/resized/869b40793dc9aa91a438b1eb6ceeaa96.jpg" alt="">
        </a>
    </div>

In this case the .jpg link has nothing to refer to. Now what should I do to get the .jpg link for each user?

I've not used BS for a while though you can search the `div` (for which you know the class) and then take the first child of the first child (and get its `src`) — Thomas Kowalski, May 15 '17 at 20:11
Hi man, thanks for your advise, I tried multiple times following ur advices. But I am very new to this and always get errors. Can you give me a line of code limiting the tag? — Psyduck, May 16 '17 at 16:56
You should rather use @alecxe idea, it's way cleaner and more general. — Thomas Kowalski, May 16 '17 at 17:00

alecxe · Accepted Answer · 2017-05-16T17:11:46.990

2

You can use the img element parent elements to create your locator. I would use the following CSS selector that would match img elements directly under the a elements directly under the element having profile-photo class:

soup.select(".profile-photo > a > img")

To get the src values:

for image in soup.select(".profile-photo > a > img"):
    print(image['src'])

edited May 16 '17 at 17:11

answered May 15 '17 at 20:13

alecxe

462,703
120
1,088
1,195

Alecxe, thank you very much for your service. What do you think about regex I'm aware that it's slow. – innicoder May 15 '17 at 20:30
@ElvirMuslic no problem. I think you are asking about regexes in the context of using it for HTML parsing. If this is the case, it is, generally speaking, a controversial topic and it's not trivial to come up with a generic answer. There is that famous topic that raises a lot of great points: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. Thanks. – alecxe May 15 '17 at 20:32
@alecxe Thank you for that, I did review it in few minutes but I'll sure take an in-depth (with notes) look when I'm free. I do feel like I have to come up with a generic answer because when I don't I see someone else do it in a simple manner and it crushes me :c haha, the Outstanding knowledge you've got is admirable, thank you again for your response. – innicoder May 16 '17 at 03:34
@alecxe, hi man, I tried with `soup.select(".profile-photo > a > img")`. I got ValueError: Invalide CSS Selector: > img. Then, I got rid of `>img` and tried `soup.select(".profile-photo > a")`. It starts to return things like: ` Bucky Roberts (which is the link.string) ` I am wondering how to get the src link in the latter half? A way similiar to link.string – Psyduck May 16 '17 at 17:04
@Noob. you should not be getting the `ValueError` - the selector looks valid. Anyway, I've updated with an example on how to get the `src` values. Take a look. Thanks. – alecxe May 16 '17 at 17:12
@alecxe Yeah, it works. I realized in my code, I deleted the space before img. `>img` gives the error while `> img` works – Psyduck May 16 '17 at 19:35

Use BeautifulSoup to get profile picture without class name

1 Answers1