Webscraping with Python. Using select() method for a string that has spaces

Question

I am trying to get the link (href) off a website. The element is:

<a class="overlay no-outline" href="/photos/28716729@N06/2834595694/" tabindex="0" role="heading" aria-level="3" aria-label="puppy by mpappas83" data-rapid_p="61" id="yui_3_16_0_1_1477971884605_5513"></a>

First I am trying to match the class "overlay no-outline". But notice that it has a space, so the select() method is treating it like they are two different selectors instead of one.

imgElem= soup.select('.overlay no-outline')      #attempt

Does anyone know how would I be able to achieve this?

The website is at www.flickr.com

http://stackoverflow.com/questions/34433544/include-multiple-class-names-in-findall-in-beautifulsoup4 This seems to be what you're looking for. — David Edwards, Nov 01 '16 at 04:40

Martin Evans · Accepted Answer · 2016-11-01T08:15:27.087

The following approach should help:

import bs4

html = """<a class="overlay no-outline" href="/photos/28716729@N06/2834595694/" tabindex="0" role="heading" aria-level="3" aria-label="puppy by mpappas83" data-rapid_p="61" id="yui_3_16_0_1_1477971884605_5513"></a>"""
soup = bs4.BeautifulSoup(html, "html.parser")

for link in soup.select("a.overlay.no-outline"):
    print link['href']

Which displays:

/photos/28716729@N06/2834595694/

The space between is used to signal two different classes are being applied, the BeautifulSoup documentation does have a section on how to address this using the above method. Look for the text "If you want to search for tags that match two or more CSS classes".

score 0 · Answer 2 · answered Nov 01 '16 at 05:11

import os
from lxml import html
import requests
import fnmatch

class HtmlRat:

    def __init__(self):
        pass

    def req_page(self, url):
        page = requests.get(url)
        return page

    def tag_data(self, txpath):
        tag_val = tree1.xpath(txpath + "/text()")
        val = ''.join(tag_val).strip(' ')
        val = val.split(' ')
        return val

def link_grabber(url, pattern):
    markup = HtmlRat()
    tree1 = markup.req_page(url)
    for tree in tree1:
        tre = tree.split(" ")
        for t in tre:
            if fnmatch.fnmatch(t, pattern):
                print t

flickr = link_grabber("https://www.flickr.com/search/?text=cars", 'href="*"')
superstreet = link_grabber("http://www.superstreetonline.com/features/1610-2013-scion-fr-s-multipurposed/", 'href="*.jpg"')

# from here you can split it by = to get the links it self.

This should work. But when we read the source of the page the links aren't there. Pretty sure they are generated via the back end. Check out pexels or some other sites using the code and you should be good.

Webscraping with Python. Using select() method for a string that has spaces

2 Answers2