0

when I learn BeautifulSoup library and try to crawl a webpage, I can limit the search result by limiting the attributes like: a, class name = user-name, which can be found by inspecting the HTML source.

Here is a success example:

    <a href="https://thenewboston.com/profile.php?user=2" class="user-name">
                                            Bucky Roberts </a>                                            

I can easily tell

    soup = BeautifulSoup(plain_text,'html.parser')
    for link in soup.findAll('a', {'class': 'user-name'}):

However, when I try to get the profile photo's link, I see the code below by inspecting:

    <div class="panel profile-photo">
        <a href="https://thenewboston.com/profile.php?user=2">
            <img src="/photos/users/2/resized/869b40793dc9aa91a438b1eb6ceeaa96.jpg" alt="">
        </a>
    </div>

In this case the .jpg link has nothing to refer to. Now what should I do to get the .jpg link for each user?

Psyduck
  • 637
  • 3
  • 10
  • 22
  • I've not used BS for a while though you can search the `div` (for which you know the class) and then take the first child of the first child (and get its `src`) – Thomas Kowalski May 15 '17 at 20:11
  • Hi man, thanks for your advise, I tried multiple times following ur advices. But I am very new to this and always get errors. Can you give me a line of code limiting the tag? – Psyduck May 16 '17 at 16:56
  • You should rather use @alecxe idea, it's way cleaner and more general. – Thomas Kowalski May 16 '17 at 17:00

1 Answers1

2

You can use the img element parent elements to create your locator. I would use the following CSS selector that would match img elements directly under the a elements directly under the element having profile-photo class:

soup.select(".profile-photo > a > img")

To get the src values:

for image in soup.select(".profile-photo > a > img"):
    print(image['src'])
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Alecxe, thank you very much for your service. What do you think about regex I'm aware that it's slow. – innicoder May 15 '17 at 20:30
  • @ElvirMuslic no problem. I think you are asking about regexes in the context of using it for HTML parsing. If this is the case, it is, generally speaking, a controversial topic and it's not trivial to come up with a generic answer. There is that famous topic that raises a lot of great points: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454. Thanks. – alecxe May 15 '17 at 20:32
  • @alecxe Thank you for that, I did review it in few minutes but I'll sure take an in-depth (with notes) look when I'm free. I do feel like I have to come up with a generic answer because when I don't I see someone else do it in a simple manner and it crushes me :c haha, the Outstanding knowledge you've got is admirable, thank you again for your response. – innicoder May 16 '17 at 03:34
  • @alecxe, hi man, I tried with `soup.select(".profile-photo > a > img")`. I got ValueError: Invalide CSS Selector: > img. Then, I got rid of `>img` and tried `soup.select(".profile-photo > a")`. It starts to return things like: ` Bucky Roberts (which is the link.string) ` I am wondering how to get the src link in the latter half? A way similiar to link.string – Psyduck May 16 '17 at 17:04
  • @Noob. you should not be getting the `ValueError` - the selector looks valid. Anyway, I've updated with an example on how to get the `src` values. Take a look. Thanks. – alecxe May 16 '17 at 17:12
  • @alecxe Yeah, it works. I realized in my code, I deleted the space before img. `>img` gives the error while `> img` works – Psyduck May 16 '17 at 19:35