Python: why site is not parsing?

Question

I run this code on the website: juventus.com.I can parse the title

from urllib import urlopen
import re

webpage = urlopen('http://juventus.com').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle

output is:

['Welcome - Juventus.com']

but if try same code on another website return is nothing

from urllib import urlopen
import re

webpage = urlopen('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq').read()
patFinderTitle = re.compile('<title>(.*)</title>')
findPatTitle = re.findall(patFinderTitle, webpage)
print findPatTitle

does anyone know why?

The page is redirected to another one.. are you following the redirect? — msturdy, Jul 25 '13 at 17:06
I recommend caching the site and check the saved html page. Check if that is the page you want. I noticed it needs authentication, but that won't be a problem because the page has a title. Cache it like file("cached.html", "w").write(webpage) — AliBZ, Jul 25 '13 at 17:08
@FillethackerRanjid `urllib.urlopen` doesn't follow redirects - try using `urllib2.urlopen` - also - you may wish to consider `BeautifulSoup` for parsing HTML instead of regular expressions, and the `requests` library is great easier to understand html requests... — Jon Clements, Jul 25 '13 at 17:08
use python to display what the page is returning: `print webpage`.. you'll see that it is being redirected to another page with javascript.. then maybe you can parse that link out and follow it? — msturdy, Jul 25 '13 at 17:09
[Please don't try to parse HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags). [Use an HTML Parser instead](http://stackoverflow.com/questions/11709079/parsing-html-python). — thegrinner, Jul 25 '13 at 17:16

falsetru · Accepted Answer · 2013-07-25T17:50:20.103

4

The content of http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq is: (modified to make it easy to read)

<script type='text/javascript'>
top.location.href = 'https://www.facebook.com/dialog/oauth?
client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&
state=07c9ba739d9340de596f64ae21754376&scope=email&0=publish_actions';
</script>

There's no title tag; no regular expression match.

Use selenium to evaluate javascript:

from selenium import webdriver

driver = webdriver.Firefox() # webdriver.PhantomJS()
driver.get('http://bp1.shoguto.com/detail.php?userg=hhchpxqhacciliq')
print driver.title
driver.quit()

edited Jul 25 '13 at 17:50

answered Jul 25 '13 at 17:08

falsetru

357,413
63
732
636

falsetru is right, just disable your browser's javascript and check the site again. – AliBZ Jul 25 '13 at 17:09
ok i see the same on my screen now.but is it possible to parse the site i want to parse? – Fillethacker Ranjid Jul 25 '13 at 17:17
@FillethackerRanjid, Using selenium will give you expected result. – falsetru Jul 25 '13 at 17:17
Can't i do that with python? – Fillethacker Ranjid Jul 25 '13 at 17:19
1

If you want to evaluate the javascript, I'd suggest using Selenium. Then like Marcin said, if you want to parse html, BeautifulSoup is a great way to go. If all you want to check is the title tag, you can probably get away with using a regex. [Here](http://stackoverflow.com/questions/17768460/mechanize-not-showing-fb-messages-form/17769190#17769190) is some information on using Selenium: [Here](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is some on BeautifulSoup: – Matthew Wesly Jul 25 '13 at 17:20
@FillethackerRanjid, I added selenium code. [Accept the answer](stackoverflow.com/help/accepted-answer) if my answer was helpful :) – falsetru Jul 25 '13 at 17:28
How can i accept your answer? accecpted now?But code gives me no output!! – Fillethacker Ranjid Jul 25 '13 at 17:38
it's opening a browser with facebook login page – Fillethacker Ranjid Jul 25 '13 at 17:48
@FillethackerRanjid, Yes selenium use actual browser (here Firefox). You can use [PhantomJS](http://phantomjs.org/) if you don't want window pop up. – falsetru Jul 25 '13 at 17:49

score 0 · Answer 2 · answered Jul 25 '13 at 17:09

Because the regex does not match the title tag on the page it redirects to, and it is redirected.

Your code should (a) be using beautifulsoup, or if you know the output will be well-formed xml, lxml (or lxml with beautifulsoup backend) to parse html, and not regexes (b) be using requests, a simpler module for making HTTP requests, which can handle redirects transparently.

score 0 · Answer 3 · answered Jul 25 '13 at 17:14

That's because the urlopen link contains a javascript redirection, it just doesn't contain a title tag.

This is what it contains:

<script type='text/javascript'>top.location.href = 'https://www.facebook.com/dialog/oauth?client_id=466261910087459&redirect_uri=http%3A%2F%2Fbp1.shoguto.com&state=0f9abed6de7412b5129a4d105a4be25f&scope=email&0=publish_actions';</script>

Also, I may be wrong, but you can't use urlopen to run javascript code if I recall right. You will need a different python module, can't remember its name now, but there is is a module if I recall that can run the javascript code, but will need a gui for it and a valid browser to use, eg. firefox ...

Python: why site is not parsing?

3 Answers3