-2

I am trying to request a web page via urllib2 using a regex.

Here is my code

def Get(url):
    request = urllib2.Request(url)
    page = urlOpener.open(request)
    return page.read()

page = Get(myurl)
#page = "<html>.....</html>" #local string for test
pattern = re.compile(r'^\s*(<tr>$\s*<td height="25.*?</tr>)$', re.M | re.I | re.DOTALL)
for task in pattern.findall(taskListPage):

If I use a local string (same as Get(myurl)' s result) for test, the pattern works, but if i use Get(myurl), the pattern does not work.

I will be grateful if someone can tell me why.

pavium
  • 14,808
  • 4
  • 33
  • 50
user299654
  • 36
  • 3
  • 2
    Please read the top answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Jim Garrison Jul 04 '11 at 02:39
  • As Jim's comment points out, trying to parse HTML with regex will eventually drive you to the brink of insanity. I suggest using a [more powerful parser](http://www.crummy.com/software/BeautifulSoup/) that can handle non-regular languages better and is more resilient to minor page modifications. – sarnold Jul 04 '11 at 02:48
  • See also: http://stackoverflow.com/questions/6556141/regex-to-extract-favicon-url-from-a-webpage/6556360#6556360 – johnsyweb Jul 04 '11 at 03:05
  • Not worth being questioned: it has been discussed numerous times that regexes are the appropriate choice for parsing HTML. Do your research first. –  Jul 04 '11 at 03:50
  • 1
    @user299654 Please, explain more your problem, I don't understand it as exposed. What do you mean by _"If I use ... , but if I use ..."_ : what action is supposed the verb 'use' to describe ? What is **taskListPage** ? Why obtaining **page** with ``page = Get(myurl)`` if it isn't employed after this instruction ? By the way , ``r'^\s*($\s*)$', re.M | re.I | re.DOTALL)`` is a regex (a RegexObject, more exactly). – eyquem Jul 04 '11 at 11:32

1 Answers1

1

Valid reservations about using regex on html aside, try this regex instead:

(<tr>\s*<td height="25.*?</tr>)

You were finding only matches at end of input $, and had problem terms at front of regex.

This match is a brittle - let's hope the web guy doesn't change the height of the rows...

Bohemian
  • 412,405
  • 93
  • 575
  • 722