Parsing with BeautifulSoup Python with dynamic link

Question

I was trying to parse table information listed on this site:

https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry

This is the following code I'm using:

link = re.findall(re.compile('<a href="(.*?)">'), str(row))
link = 'https://www.theice.com'+link[0]
print link #Double check if link is correct
user_agent = 'Mozilla/20.0.1 (compatible; MSIE 5.5; Windows NT)'
headers = {'User-Agent':user_agent}
req = urllib2.Request(link, headers = headers)
try:
    pg = urllib2.urlopen(req).read()
    page = BeautifulSoup(pg)
except urllib2.HTTPError, e:
    print 'Error:', e.code, '\n', '\n'

table = page.find('table', attrs = {'class':'default'})
tr_odd = table.findAll('tr', attrs = {'class':'odd'})
tr_even = table.findAll('tr', attrs = {'class':'even'})
print tr_odd, tr_even

For some reason, during the urllib2.urlopen(req).read() step, the link changes, i.e., the link doesn't contain the same url as the one provided above. Therefore, my program opens a different page and the variable page stores information form this new, different site. Thus, my tr_odd and tr_even variables are NULL.

What could be the reason for the link changing? Is there another way to access the contents of this page? All I need are the table values.

What do you mean "the link changes"? Does the `link` variable change its value? How do you see this happening? Did you add another `print link` on the next line and see that it's different? — abarnert, Jul 12 '13 at 00:49
I mean, the link is not the same. For example, if I copy that link (after the print statement) I can visit the correct website. But, when the program runs, it doesn't go to that same website. It's very strange. — James Hallen, Jul 12 '13 at 00:52
The link is not the same as what? As itself? You're still not making yourself clear. — abarnert, Jul 12 '13 at 00:55

score 1 · Accepted Answer · edited May 23 '17 at 12:29

1

The information in this page is being supplied by a JavaScript function. When you download the page with urllib you get the page before the JavaScript is executed. When you view the page in a standard browser manually, you see the HTML after the JavaScript has been executed.

To get at the data programmatically, you need to use some tool that can execute JavaScript. There are a number of 3rd party options available for Python, such as selenium, WebKit, or spidermonkey.

Here is an example of how to scrape the page using selenium (with phantomjs) and lxml:

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry'

with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
    driver.get(link)
    content = driver.page_source
    doc = LH.fromstring(content)
    tds = doc.xpath(
        '//table[@class="default"]//tr[@class="odd" or @class="even"]/td/text()')
    print('\n'.join(map(str, zip(*[iter(tds)]*5))))

yields

('Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13')
('Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13')
('Sep13', '2/11/13', '9/27/13', '9/27/13', '9/27/13')
('Oct13', '2/11/13', '10/25/13', '10/25/13', '10/25/13')
...
('Aug18', '2/11/13', '8/31/18', '8/31/18', '8/31/18')
('Sep18', '2/11/13', '9/28/18', '9/28/18', '9/28/18')
('Oct18', '2/11/13', '10/26/18', '10/26/18', '10/26/18')
('Nov18', '2/11/13', '11/30/18', '11/30/18', '11/30/18')
('Dec18', '2/11/13', '12/28/18', '12/28/18', '12/28/18')

Explanation of the XPath:

lxml allows you to select tags using XPath. The XPath

'//table[@class="default"]//tr[@class="odd" or @class="even"]/td/text()'

means

//table    # search recursively for <table>
  [@class="default"]  # with an attribute class="default"
  //tr     # and find inside <table> all <tr> tags
    [@class="odd" or @class="even"]   # that have attribute class="odd" or class="even"
    /td      # find the <td> tags which are direct children of the <tr> tags  
      /text()  # return the text inside the <td> tag

Explanation of zip(*[iter(tds)]*5):

The tds is a list. It looks something like

['Jul13', '2/11/13', '7/26/13', '7/26/13', '7/26/13', 'Aug13', '2/11/13', '8/30/13', '8/30/13', '8/30/13',...]

Notice that each row of the table consists of 5 items. But our list is flat. So, to group every 5 items together into a tuple, we can use the grouper recipe. zip(*[iter(tds)]*5) is an application of the grouper recipe. It takes a flat list, like tds, and turns it into a list of tuples with every 5 items grouped together.

Here is an explanation of how the grouper recipe works. Please read that and if you have any question about it, I'll be glad to try to answer.

To get just the first column of the table, change the XPath to:

tds = doc.xpath(
    '''//table[@class="default"]
         //tr[@class="odd" or @class="even"]
           /td[1]/text()''')
print(tds)

For example,

import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
link = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=6753474#expiry'
with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver:
    driver.get(link)
    content = driver.page_source
    doc = LH.fromstring(content)
    tds = doc.xpath(
        '''//table[@class="default"]
             //tr[@class="odd" or @class="even"]
               /td[1]/text()''')
    print(tds)

yields

['Jul13', 'Aug13', 'Sep13', 'Oct13', 'Nov13', 'Dec13', 'Jan14', 'Feb14', 'Mar14', 'Apr14', 'May14', 'Jun14', 'Jul14', 'Aug14', 'Sep14', 'Oct14', 'Nov14', 'Dec14', 'Jan15', 'Feb15', 'Mar15', 'Apr15', 'May15', 'Jun15', 'Jul15', 'Aug15', 'Sep15', 'Oct15', 'Nov15', 'Dec15']

edited May 23 '17 at 12:29

Community

1
1

answered Jul 12 '13 at 01:41

unutbu

842,883
184
1,785
1,677

If you don't mind, could you explain your last 2 lines of code? – James Hallen Jul 12 '13 at 01:43
Thanks for the great explanation, but suppose I was only interested in the first column, i.e.: `Jul13, Aug13, Sep13...`, how should I make changes `zip(*[iter(tds)]*1)`? – James Hallen Jul 12 '13 at 02:05
Also, after step `content = driver.page_source`, is it possible to parse it to `beautifulsoup`? – James Hallen Jul 12 '13 at 02:10
Yes, it is. (Sorry to have changed that on you; I just like lxml better.) – unutbu Jul 12 '13 at 02:11
No problem, the more the merrier, also I'm running Python 2.7 and I seem to get an error in the `with` statement: `File "C:\Python27\lib\site-packages\selenium-2.33.0-py2.7.egg\selenium\webdriver\phantomjs\service.py", line 63, in start raise WebDriverException("Unable to start phantomjs with ghostdriver.", e) WebDriverException: Message: 'Unable to start phantomjs with ghostdriver.' ; Screenshot: available via screen` – James Hallen Jul 12 '13 at 02:14
When you install phantomjs, you get an executable named `phantomjs`. You need to change the string in `webdriver.PhantomJS('phantomjs')` to the path to the phantomjs executable: `webdriver.PhantomJS('/path/to/executable'))`. `phantomjs` gives you a headless browser. I use it so I don't have to see Firefox pop up. If you don't want to install phantomjs, you can use `webdriver.Ie()` to get Internet Explorer if you already have Internet Explorer installed. – unutbu Jul 12 '13 at 02:22
[The source code in service.py](http://selenium.googlecode.com/git/docs/api/py/_modules/selenium/webdriver/phantomjs/service.html) shows you get the error message `Unable to start phantomjs with ghostdriver` when an exception is raised on `subprocess.Popen(self.service_args,..)`. This line is starting the phantomjs binary. So it looks like you either [need to install phantomjs](http://phantomjs.org/download.html), or give a full path to the binary (as mentioned above) or just use `webdriver.Ie()` or `webdriver.Firefox()` or `webdriver.Chrome()` (if you have one of those browsers installed). – unutbu Jul 12 '13 at 02:35
Thanks for all the help, the program works perfectly, but I have a few questions. Sorry, I'm still a beginner at Python, so I will ask some stupid ones. 1) I don't understand your `with ` command, how did this help with loading the java script. 2) What was the exe for? It looks like every time I run a link, a black command box will open up. 3) Is there a fundamental difference between lxml and beautifulsoup, should I use one over other? – James Hallen Jul 12 '13 at 02:38
1

The `with` statement sets up a [context manager](http://docs.python.org/2/reference/datamodel.html#context-managers). The reason why I chose to use one here is to guarantee that `driver.close()` is called when Python leaves the `with`-suite. Sure, you could forego the `with blahblahblah`, and just use `driver = webdriver.PhantomJS('phantomjs')`, but then you'd have to remember to call `driver.close()` yourself and make sure under all conditions (including exceptions) that Python calls `driver.close()`. Instead of being bothered with that, the "best practice" is to use a `with` statement. – unutbu Jul 12 '13 at 02:49
I'm not sure what you are talking about in the second question. I suppose you mean phantomjs.exe? It gives you a headless browser. What code are you using to "run a link"? Is the black command box a Windows console? If so, I'm afraid I won't be able to help you; it sounds like a Windows thing, and I'm using Ubuntu. – unutbu Jul 12 '13 at 02:54
lxml and beautifulsoup are both HTML/XML parsers. beautifulsoup is written largely in Python, while lxml is a wrapper around the C library libxml2 and libxslt. [`lxml` is much faster than `beautifulsoup`](http://www.crummy.com/2012/01/22/0). The advantage of `beautifulsoup` is that has a more forgiving parser which can sometimes parse broken HTML. However, [beautifulsoup can use the lxml parser](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser), and [lxml can use the beautifulsoup parser](http://lxml.de/elementsoup.html). – unutbu Jul 12 '13 at 03:02
So if you install both, you can use either for parsing. Then it just becomes a choice between what API you like better for navigating HTML/XML. I prefer `lxml` because `XPath` is more compact than beautifulsoup's API. [XPath takes a while to learn](http://www.w3schools.com/xpath/xpath_syntax.asp) but in the end I think it is simpler than memorizing the multitude of methods Beautifulsoup makes available: findAll ,findPrevious, findNext, nextSibling, previousSibling, nextElement, find_parents, find_next_sibling, find_previous_siblings, find_all_next, ... blech; it's just too much! – unutbu Jul 12 '13 at 03:09
There are other differences too. Lxml handles namespaces, XSLT, sax handlers, DTD validation, BeautifulSoup (I think) does not. – unutbu Jul 12 '13 at 10:47
Thanks for all the explanation, could you provide the code for the `try` `finally` clauses the `with` statement incorporates. – James Hallen Jul 12 '13 at 12:09
Something like this? `with ... as driver BLCOK` = `driver.__enter__() try: BLOCK finally: driver.__exit__()` – James Hallen Jul 12 '13 at 12:14
1

Sure. The `try..finally` is the one [shown here](http://docs.python.org/2/library/contextlib.html#contextlib.closing). You may also find [this explanation (by Fredrik Lundh)](http://effbot.org/zone/python-with-statement.htm) useful. – unutbu Jul 12 '13 at 12:15
Sorry, just trying to clarify my understanding. In `with contextlib.closing('link') as driver`, the decorator is `contextmanager` which takes `context.closing()` function as an argument. And that function is defined to with a `try` (which probably opens the link) and `finally` (which closes the link) ? – James Hallen Jul 13 '13 at 01:37
I think what is confusing is that there are two ways to define a contextmanager. Any object with `__enter__` and `__exit__` methods is a contextmanager. But you can also define a contextmanager by writing a generator function and wrapping it with the `@contextmanager` decorator. ["At the point where the generator yields, the block nested in the with statement is executed."](http://docs.python.org/2/library/contextlib.html#contextlib.contextmanager). When Python leaves the `with`-block either normally or due to an exception, execution resumes at the `yield` expression. – unutbu Jul 13 '13 at 02:13
In `with contextlib.closing(webdriver.PhantomJS('phantomjs')) as driver`, `contextlib.closing` is a decorator which takes `webdriver.PhantomJS('phantomjs')` as an argument. `contextlib.closing` returns a `contextmanager` -- i.e. an object with `__enter__` and `__exit__` methods. The actual definition of `contextlib.closing` can be [seen here](http://hg.python.org/cpython/file/86cc1983a94d/Lib/contextlib.py#l132). The documentation explains the behavior of `contextlib.closing` using a `try..finally` statement. But as you can see, the actual code just uses `__enter__` and `__exit__`. – unutbu Jul 13 '13 at 02:21
Sorry, but isn't `contextmanager` the decorator and `context.closing()` the wrapper and `webdriver.PhantomJS('phantomjs')` the object being passed? This stuff is very new/confusing for me! – James Hallen Jul 13 '13 at 02:41
Sorry, I've been typing the word wrong. A [context manager](http://docs.python.org/2/library/stdtypes.html#context-manager-types) (2 words) is an object with `__enter__` and `__exit__` methods. The [contextlib.contextmanager](http://docs.python.org/2/library/contextlib.html#contextlib.contextmanager) is a decorator which turns a generator function into a `context manager`. Above, everywhere I wrote `contextmanager` I meant `context manager`. When I wanted to refer to the decorator I wrote `@contextmanager`. – unutbu Jul 13 '13 at 09:50
I've been calling `contextlib.closing` a decorator. But on further reflection, I should really have been calling it something else, because it is not used to decorate functions. Instead, [`contextlib.closing` is a class](http://hg.python.org/cpython/file/86cc1983a94d/Lib/contextlib.py#l132). This class defines `__enter__` and `__exit__` methods. Instances of `contextlib.closing` are `context managers`. `webdriver.PhantomJS('phantomjs')` is the object being passed to `contextlib.closing`. Its `close` method is guaranteed to get called when Python leaves the `with`-suite. – unutbu Jul 13 '13 at 09:54
Thanks for all the clear up, looking at this link: `http://docs.python.org/2/library/contextlib.html#contextlib.closing`, in `def closing(thing):` is the `thing` `page` or is it `line`? So when we get to the step `yield thing`, what is being iterated through, the `line` right? – James Hallen Jul 13 '13 at 13:55
`thing` is `urllib.urlopen('http://www.python.org')`. After `yield thing`, `page` is assigned the value `urllib.urlopen('http://www.python.org')`. – unutbu Jul 13 '13 at 14:03
Thanks again for all the clear up. I'll need to spend another few days reviewing/googling to get a better understanding of generators/decorators/Context Managers. Going back to your code, the `tds = doc.xpath(...)` method doesn't work for this site specifically for link: `https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=106A2FB63BB67E82AC398396DF4F66F7?specId=6753474#expiry`. Do you know why that is? – James Hallen Jul 13 '13 at 15:50
Remove the `jsessionid`. Try: `link = 'https://www.theice.com/productguide/ProductSpec.shtml?specId=6753474#expiry'`. – unutbu Jul 13 '13 at 17:03
Hey, I tried it without the `jsessionid` and it still didn't work. – James Hallen Jul 13 '13 at 21:05
Hm, that's odd! I've edited my post to show the code I am using with this link as the result. – unutbu Jul 13 '13 at 21:25
Hey, if you don't mind, could you check this question please: `http://stackoverflow.com/questions/17953580/parsing-with-lxml-xpath` – James Hallen Jul 31 '13 at 01:42
Hey, I just installed Ubuntu on my computer, I can get the code above to run properly, the only difference is that you don't mention the path of the `phantomjs` in your code while I do: `/usr/local/bin/phantomjs`. What can I do to change this? – James Hallen Aug 08 '13 at 16:08
1

By default, `/usr/local/bin` should be in your `PATH` environment variable. If that is true, you should be able to use the string `'phantomjs'` instead of the full path. You can check if `/usr/local/bin` is in your `PATH` by typing `echo $PATH` in a terminal, or by simplying typing `phantomjs` and seeing if you get a `phantomjs>` prompt. If `/usr/local/bin` is not in your `PATH`, see [this page](https://help.ubuntu.com/community/EnvironmentVariables) for info on setting "System-wide environment variables". Note: `/usr/local/bin` should already be listed in `/etc/environment`. – unutbu Aug 08 '13 at 17:01

abarnert · Answer 2 · 2013-07-12T00:59:32.780

0

I don't think the link is actually changing.

Anyway, the problem is that your regex is wrong. If you take the links it prints out and paste it into a browser, you get a blank page, or the wrong page, or a redirect to the wrong page. And Python is going to download the exact same thing.

Here's a link from the actual page:

<a href="/productguide/MarginRates.shtml;jsessionid=B53D8EF107AAC5F37F0ADF627B843B58?index=&amp;specId=19118104" class="marginrates"></a>

Here's what your regex finds:

/productguide/MarginRates.shtml;jsessionid=B53D8EF107AAC5F37F0ADF627B843B58?index=&amp;specId=19118104

Notice that & there? You need to decode that to & or your URL is wrong. Instead of having a query-string variable specId with value 19118104, you've got a query-string variable amp;specId (although technically, you can't have unescaped semicolons like that either, so everything from jsession on is a fragment).

You'll notice that if you paste the first one into a browser, you get a blank page. I you remove the extra amp;, then you get the right page (after a redirect). And the same is true in Python.

edited Jul 12 '13 at 00:59

answered Jul 12 '13 at 00:54

abarnert

354,177
51
601
671

Hi, if you run my code, and execute the `print` statement, you will see this link: `https://www.theice.com/productguide/ProductSpec.shtml;jsessionid=7A651D7E9437F76904BEC5623DBAB055?specId=19118104#expiry`. Which is the correct website. However, there are 2 similar sites: `https://www.theice.com/productguide/ProductSpec.shtml?specId=19118104#data` and `https://www.theice.com/productguide/ProductSpec.shtml?specId=19118104#` I think my program automatically goes to the last link. – James Hallen Jul 12 '13 at 01:08
I don't have your code, only the fragment of it that you provided, so I can't run it. Meanwhile, if I download the page manually and run your regexp against it, I get all kinds of things like `#" class="btn`, `/changePassword" style="float:left;clear:left;`, `/publicdocs/futures_us/ICE_Monthly_Softs_Fast_Facts.pdf" target="_blank`, etc. I can't guess which one of these is "the link" that you're expecting to work unless you tell me. – abarnert Jul 12 '13 at 01:18
Actually… the first result is `/`. So, if that's a fragment of your real code (why would you use `findall` just to throw away all but the first one?), I _can_ guess the link. It's just `/`. Which is nothing like what you say it is. – abarnert Jul 12 '13 at 01:19

Parsing with BeautifulSoup Python with dynamic link

2 Answers2