0

so I'm pretty new to this, and I haven't been able to find anything on google on this question.

I'm using request and lxml with Python, I've seen that there's a lot of different modules for web scraping, but is there any reason to choose one over the other? Can you do the same stuff with requests/lxml as you can with for example BeautifulSoup?

Anyway, here's my actual question;

This is my code:

import requests
from lxml import html

# Login data
inputUrl = 'http://forum.mytestsite.com/login'
usr = 'myusername'
pwd = 'mypassword'
payload = dict(login=usr, password=pwd)

# Open session
with requests.Session() as s:
    # Login
    s.post(inputUrl, data=payload)

    # Get page data
    pageResult = s.get('http://forum.mytestsite.com/icons/', allow_redirects=False)
    pageResult = html.fromstring(pageResult.content)
    pageIcons = pageResult.xpath('//script[@id="table-icons"]/text()')
    print pageIcons[0]

The result when printing pageIcons[0]:

<ul id="icons">
{{#each icons}}
   <li data-handle="{{handle}}">
     <img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
   </li>
{{/each}}
</ul>


This is the website/js code that generates the icons:

<script id="table-icons" type="text/x-handlebars-template">
  <ul id="icons">
    {{#each icons}}
       <li data-handle="{{handle}}">
         <img src="{{image_path}}" alt="{{desc_or_name this}}" title="{{desc_or_name this}}">
       </li>
    {{/each}}
  </ul>
</script>

And here's the result on the page:

<ul id="icons">
    <li data-handle="558FSTBI" class="">
        <img src="http://testsite.com/icons/558FSTBI.1.png" alt="Icon 1" title="Icon 1">
    </li>
    <li data-handle="310AYTZI">
        <img src="http://testsite.com/icons/310AYTZI.1.png" alt="Icon 2" title="Icon 2">
    </li>
    <li data-handle="669PQXBI" class="">
        <img src="http://testsite.com/icons/669PQXBI.1.png" alt="Icon 3" title="Icon 3">
    </li>
</ul>



My goal:
What I would like to do is to retrieve all of li data-handles, but I haven't been able to figure out how to retrieve this data. So my goal is to retrieve all of the icon paths and their titles, could anyone help me out here? I'd really appreciate any help :)

Lorena
  • 1
  • 2
  • `//script` is not part of the rendered HTML. Why are you trying to parse the template code ? – OneCricketeer Jun 18 '17 at 13:34
  • Well, because I'm a noob :P I tought as the result of the script gives me the ul/li handles of what I actually want, that it was logical to do it that way. I mean, the rendered HTML is generated from the script, right? How else can I get the links? – Lorena Jun 18 '17 at 14:09
  • You can't get the template code from python requests. Plus, if it's rendered after the page loads, then you get an empty list and you can't use requests anyway . https://stackoverflow.com/questions/13960567/reading-dynamically-generated-web-pages-using-python – OneCricketeer Jun 18 '17 at 14:12

1 Answers1

0

You aren't parsing the li or ul.

Start with this

//ul[@id='icons']/li/img

And from those elements, you can extract the individual information

Regarding the first question, beautifulsoup optionally uses lxml. If you don't think you need it, and are comfortable with XPath, don't worry about it.

However, since it's Javascript generating the page, you need a headless browser rather than requests library.

Get page generated with Javascript in Python

Reading dynamically generated web pages using python

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • That's what's weird, if I try that I just get an empty list back. It doesn't look like it's possible to get the content of the links :/ How can I go about debugging this? – Lorena Jun 18 '17 at 14:05
  • I used an online XPath tool, and it worked fine after I closed the `` – OneCricketeer Jun 18 '17 at 14:09
  • Thanks for your help. Too bad it isn't possible to get JS generated pages without having to emulate a browser :/ – Lorena Jun 18 '17 at 14:50