3

I'm trying to learn how to find/parse data from html5 webpages to use in a database. I want to learn how to find/parse the data from only the first of this '//div[@class="col-xs-12 col-sm-6 col-md-4 col-lg-3"]'

I've tried html5lib, from lxml import html and xpath but the lack of documentation for my specific use is frustrating, couldn't really find how I can achieve this.

Data to find and store:

http://csgo.steamanalyst.com/id/120565/ 
from <span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/'

And the 2 numbers from "addToCart(1852864,1108)" as id1:'1852864' and id2:'1108'

in <button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'

the html code i'm trying to learn from

<!DOCTYPE html> 

<div class='row'><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852864'>StatTrak&#8482; Desert Eagle | Conspiracy (Factory New)</a><br /><small class='text-muted'>StatTrak&#8482; Classified Pistol</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>1,108</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120565/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>1,451</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&StatTrak=1&search_item=+Desert+Eagle+%7C+Conspiracy+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem1' onclick='addToCart(1852864,1108)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1841001'>★ Karambit | Doppler (Factory New)</a><br /><small class='text-muted'>★ Covert Knife</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>155,000</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/62403692/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>30,300</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=%E2%98%85+Karambit+%7C+Doppler+%28Factory+New%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem2' onclick='addToCart(1841001,155000)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852853'>AK-47 | Redline (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>441</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/1420/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>520</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=AK-47+%7C+Redline+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem3' onclick='addToCart(1852853,441)'>Add to cart</button></center></div>
    </div>
  </div></div><!-- /.col-md-4 --><div class='col-xs-12 col-sm-6 col-md-4 col-lg-3'><div class='featured-item'><a class='market-name market-link' href='https://opskins.com/index.php?loc=shop_view_item&item=1852846'>M4A1-S | Master Piece (Field-Tested)</a><br /><small class='text-muted'>Classified Rifle</small><img style='margin-top:-25px;' src='256fx256f' />
    <div class='item-add'>
      <div class='item-amount'><span class='icon-logo'></span>6,618</div>
      <div class='market-name' style='padding-bottom:0.3em;'><span class='market-name'><a style='color:white;' href='http://csgo.steamanalyst.com/id/120409/' target='_BLANK'>Suggested Price: <span class='icon-logo'></span>8,905</a></div>
                <div class='item-buttons'><center> class='btn btn-primary' style='margin-right:4px'>Inspect</a><a href ='/?loc=shop_search&sort=lh&search_item=M4A1-S+%7C+Master+Piece+%28Field-Tested%29' class='btn btn-primary'>Search</a>
                    <br /><button class='btn btn-orange' type='button' style='font-size:1.2em;margin-top:2px;' id='shopItem4' onclick='addToCart(1852846,6618)'>Add to cart</button></center></div>
    </div>
Marie Anne
  • 301
  • 1
  • 2
  • 12

2 Answers2

1

Use the html parser in the lxml library. For the working example below your HTML is assigned to myhtml. There may be a more elegant way to parse the text from the button attribute, but this is a start.

>>> from lxml import html
>>> tree = html.fromstring(myhtml)
>>> mybuttons = tree.xpath('//button[@class="btn btn-orange" and @onclick]')
>>> len(mybuttons)
4
>>> for button in mybuttons:
...     (id1, id2) = button.attrib['onclick'].replace('(', ' ').replace(',', ' ').replace(')', ' ').split()[1:]
...     print id1, id2
... 
1852864 1108
1841001 155000
1852853 441
1852846 6618
>>> myurl = tree.xpath('//span[@class="market-name"]/a')
>>> for u in myurl:
...     href = u.attrib['href']
...     print href
... 
http://csgo.steamanalyst.com/id/120565/
http://csgo.steamanalyst.com/id/62403692/
http://csgo.steamanalyst.com/id/1420/
http://csgo.steamanalyst.com/id/120409/
>>> 
Thane Plummer
  • 7,966
  • 3
  • 26
  • 30
  • This is what I'm looking for, thank you! Although for the button attribute, it returns a KeyError `File "lxml.etree.pyx", line 2295, in lxml.etree._Attrib.__getitem__ (src/lxml/lxml.etree.c:59791) KeyError: 'onclick'` – Marie Anne Aug 04 '15 at 03:06
  • @MarieAnne If you are reading from a file, for example your HTML is in a file called `myhtml.htm`, you will need to change the tree reader line from `tree = html.fromstring(myhtml)` to `tree = html.parse('myhtml.htm')`. The posted answer parses the data as as string, but it works just as well if you parse from a file as shown in this comment. – Thane Plummer Aug 04 '15 at 03:43
  • @MarieAnne I edited the code above to work with the URL you provided by changing the selector to require the `onclick` attribute. You may want to delete all the scripts to make it easier to parse. – Thane Plummer Aug 04 '15 at 04:22
  • This is exactly what i was looking for, thank you. Just one more question please, is it possible to parse these strings as linked data between href, id1, id2 and the next href, id1, id2, etc, etc, instead of having 2 completely different lists ? – Marie Anne Aug 04 '15 at 15:57
  • Yes, you should first get the buttons and urls from the xpath query, and then merge them using the `zip` function. See https://docs.python.org/2/library/functions.html#zip. In this case it would look something like this: `for (button, u) in zip(mybuttons, myurl): # Operate on button and u here...` – Thane Plummer Aug 04 '15 at 19:21
0

I have used a simpler library for a similar problem:

import re
from HTMLParser import HTMLParser

class MyParser(HTMLParser):
  def __init__(self):
    HTMLParser.__init__(self)
    self.in_market = 0
    self.markets = {}
    self.market = None

  def handle_starttag(self, tag, attrs):
    if tag == 'span':
      if "class" in attrs and \
      and attrs["class"].indexof('market-name') != -1:
        self.in_market = 1
      elif self.in_market:
        self.in_market += 1
    elif self.in_market:
      if tag == 'a' and 'href' in attrs:
        self.market = attrs["href"]
      elif tag == 'button' and 'onclick' in attrs:
        add_to_cart_RE = re.compile(r'addToCart\((\d+),(\d+)\)')
        match = add_to_cart_RE.match(attrs["onclick"])
        self.markets[self.market] = [match.group(1), match.group(2)]


  def handle_endtag(self, tag):
    if self.tag == 'span' and self.in_market:
      self.in_market -= 1

  def handle_data(self, data):
    pass

ask me questions if the code is unclear to you.

Paul Marrington
  • 557
  • 2
  • 7
  • Isn't regex bad at parsing html ? http://stackoverflow.com/a/1732454/4570549 I'm going to try and get back to you but seems like having a lot of conditions, doesn't that hinder performance as well ? – Marie Anne Aug 04 '15 at 01:21
  • The regex was only to pull the two numbers from the onclick event. If the format is well fixed you could process it with more basic means. I should have said '^addToCart...\)$' for the most efficient regex. Then it would probably be more efficient than manual manipulation. It certainly would be in V8 - not so sure for Python. – Paul Marrington Aug 04 '15 at 02:02
  • I'm going to test regex and lxml see which works best, thank you – Marie Anne Aug 04 '15 at 03:04
  • Update, I chose to go with the lxml version for the simplicity of the code, but thank you again for this method, because of this I learnt more about regex. – Marie Anne Aug 04 '15 at 16:01