Python get web page contents that have javascripts - maybe Selenium

Question

I need to analyse web page contents. Page has javascrips. Can you advice on better way than using Selenium?

If not: page when loaded in browser has elements:

<div class="js-container">    <table class="zebra" style="width: 100%;">
        <tbody><tr>
            <th>A</th>
            <th>B</th>
            <th>C</th>
        </tr>
            <tr>
                <td>A1</td>
                <td>A2</td>
                <td>
                    <a href="http://X" style="color: black">T1</a>
                </td>
            </tr>
            <tr>
                ....
            </tr>
....

I need to read a table, element by element. I run for example:

myList = myDriver.find_elements_by_class_name("js-container").

Then how do I get inner elements of "js-container" object?

The only element resulting myList has is: print (myList[0]):

<selenium.webdriver.remote.webelement.WebElement (session="61238", element="{71293}")>

score 2 · Accepted Answer · edited Jul 16 '16 at 14:45

2

Maybe you need BeautifulSoup - feeding to it Selenium driver.page_source. It is a python tool and it can build a tree based on the web page. BeautifulSoup document

edited Jul 16 '16 at 14:45

Alex Martian

3,423
7
36
71

answered Jul 16 '16 at 11:33

Ben Lee

102
1
1
9

Is it the fact that, you want to fetch a page, which will be changed on loading, and you need the result page? – Ben Lee Jul 16 '16 at 12:01
1

When I try to use Selenium, I find the question http://stackoverflow.com/a/30103931/5359105 .Try `browser.page_source` to get page to convey it to BS. – Ben Lee Jul 16 '16 at 12:21
@Ben Lee, looks like browser.page_source does it. Thank you. I wander why it's not documented as one of main features of selenium. – Alex Martian Jul 16 '16 at 13:31

score 2 · Answer 2 · answered Jul 16 '16 at 13:23

2

Selenium can do this just fine.

tableDescendants = myDriver.find_elements_by_css_selector("table.zebra *")
for tableDescendant in tableDescendants
    outer = tableDescendant.get_attribute("outerHTML")
    inner = tableDescendant.get_attribute("innerHTML")
    print outer[:outer.find(inner)]

This code grabs all descendants of the TABLE tag, removes everything after the start of the innerHTML string and prints the result. outerHTML contains the element itself and all descendants and innerHTML contains only the descendants. So, to get only the HTML of the element itself, we need to remove innerHTML from outerHTML.

answered Jul 16 '16 at 13:23

JeffC

22,180
5
32
55

Thank you. How to specify table name with spaces? – Alex Martian Jul 16 '16 at 14:04
It sounds like you are asking a new question. If not, please clarify what you are asking. – JeffC Jul 17 '16 at 04:35
I mean if not class="zebra", but e.g. class="zebra ver2" – Alex Martian Jul 18 '16 at 12:47
1

CSS Selectors is the way to go there. The basic format is `..`, e.g. `table.zebra.ver2`. Check out these resources: [CSS Selector Reference](https://www.w3.org/TR/selectors/#selectors) and [CSS Selector Tips](https://saucelabs.com/resources/selenium/css-selectors). – JeffC Jul 18 '16 at 13:27

Python get web page contents that have javascripts - maybe Selenium

2 Answers2