2

I have a set of static Html files which I need to parse and fetch some details from.I'm using the Python - lxml module to grab the required details.A sample from the static html file is as shown below:

<div class="top">
<a data-bind="text">abc</a>
<span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
    </span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
    </span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
    <span></span>
    <span class="locality" data-bind="text: hotel.pob"></span>
</div>
</div>

<div class="top">
<a data-bind="text">dfg</a>
<span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
    </span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
    </span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
    <span></span>
    <span class="locality" data-bind="text: hotel.pob"></span>
</div>

So Here's the problem I need to get the star rating from the span class = 'star' element which is visible ; for example in the first div[@top] ,the star rating of the span that is visible is '4' while the second div[@top] doesn't have a visible span[class=star] element so it should return a star rating of '0'. However since these elements are hidden I'm having problem to fetch em and also to get the script to return '0' star ratings on div element that has all span[@class=star] 'hidden'.

This is what i have tried until now:

tree = html.fromstring(page)
for sali in tree.xpath('//div[@class="top"]'):
    for x in sali.xpath('a'):
        for sal in sali.xpath('span[not(contains(@style,"display:none"))]'): 
            print x , sal.attrib['data-bind']

But this code doesnt help me with the result I want,what mistake am I doing?

Expected Output: abc 4 dfg 0

radix
  • 198
  • 1
  • 1
  • 12

2 Answers2

1

There are a few ways to approach the problem and here is one way to go about it: get the "star" rating elements and return the index of the first "visible" element falling down to 0 if none found. We can use next() and enumerate() to achieve that:

def is_visible(element):
    """Naive implementation of the element visibility check."""
    return 'display: none;' not in element.attrib.get("style", "")


def get_rating(entry):
    rating_elements = entry.xpath(".//span[contains(@class, 'star')]")
    visibile_rating = (index 
                       for index, element in enumerate(rating_elements, start=1)
                       if is_visible(element))
    return next(visibile_rating, 0)


root = fromstring(html)
for sali in root.xpath('//div[@class="top"]'):
    for x in sali.xpath('a'):
        print(x.text, get_rating(sali))

Prints:

('abc', 4)
('dfg', 0)

Beware of the fact that class attribute is a multi-valued attribute and, strictly speaking, contains() is not the best tool for the job to find an element by a class value:

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks , instead of 'contains' what if i use tree.cssselect(star.sprite.disponibilidad) ,is it a better way? – radix Dec 06 '18 at 03:02
  • 1
    @JustinJoy Yeah, definitely a better option, go for it. – alecxe Dec 06 '18 at 03:02
  • Also by creating two function Am I affecting the overall runtime of the script? – radix Dec 06 '18 at 03:03
  • 1
    @JustinJoy in general, function calls have their cost, but unless you have a very high number of calls and you've got through all other bottlenecks (e.g. HTML parsing or tree traversals in XPaths in this case) and the execution time is super super important, then you may need to worry about extra function calls :) – alecxe Dec 06 '18 at 03:08
  • Thanks ,will try out the solution and get back here. – radix Dec 06 '18 at 03:18
0

You could use lxml via BeautifulSoup. Someone more familiar with Python can probably tidy this up

from bs4 import BeautifulSoup

html = '''
<div class="top">
<a data-bind="text">abc</a>
<span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
    </span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
    </span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
    <span></span>
    <span class="locality" data-bind="text: hotel.pob"></span>
</div>
</div>

<div class="top">
<a data-bind="text">dfg</a>
<span data-bind="visible:hotel.marca1!='' &amp;&amp; hotel.marca1!='logo_ha', attr:{title:hotel.textoMarca1}" title="Hotusa" style="display: none;">
    </span>
<span class="marca" data-bind="visible:hotel.marca1==='' || hotel.marca1==='logo_ha'">
    </span>
<span class="star sprite-disponibilidad star1" data-bind="visible:hotel.cat === '1'" style="display: none;"></span>
<span class="star sprite-disponibilidad star2" data-bind="visible:hotel.cat === '2'" style="display: none;"></span>
<span class="star sprite-disponibilidad star3" data-bind="visible:hotel.cat === '3'" style="display: none;"></span>
<span class="star sprite-disponibilidad star4" data-bind="visible:hotel.cat === '4'" style="display: none;"></span>
<span class="star sprite-disponibilidad star5" data-bind="visible:hotel.cat === '5'" style="display: none;"></span>
<div class="adr">
    <span></span>
    <span class="locality" data-bind="text: hotel.pob"></span>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
ratings = []
for item in soup.select("div.top"):
    hotel = item.select_one('a').text
    found = False
    for item2 in item.select("[data-bind*='visible:hotel.cat']"):
        try:
            style = item2['style']
        except KeyError as e:
            rating = item2['data-bind'].strip("visible:hotel.cat === ").strip("'")
            found = True
            break
    ratings.append([hotel + ' ' + rating if found else hotel + ' 0'])
print(ratings)

Output:

enter image description here

QHarr
  • 83,427
  • 12
  • 54
  • 101