2

Is there a more succinct/correct/pythonic way to do the following:

url = "http://0.0.0.0:3000/authenticate/login"
re_token = re.compile("<[^>]*authenticity_token[^>]*value=\"([^\"]*)")
for line in urllib2.urlopen(url):
    if re_token.match(line):
        token = re_token.findall(line)[0]
        break

I want to get the value of the input tag named "authenticity_token" from an HTML page:

<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />
aaronstacy
  • 6,189
  • 13
  • 59
  • 72
  • 2
    The proper way to do this is to use an HTML parser like BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/. See here for the reason: http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-rege – Ayman Hourieh Nov 08 '09 at 23:10
  • 2
    regexes shouldn't be used with html/XML -- too many ways for things to break. Look at the BeautiflSoup or one of the html parser modules. – Brian C. Lane Nov 08 '09 at 23:11
  • You need to use a parser like BeautifulSoup. What if you use a regex and a malicious user works out a way to put some text that matches the regex somewhere on the page eg. in a comment or something? You end up thinking that that is the authenticity_token, which is asking for trouble. – John La Rooy Nov 09 '09 at 00:31

4 Answers4

6

Could you use Beautiful Soup for this? The code would essentially look something like so:

from BeautifulSoup import BeautifulSoup
url = "hhttp://0.0.0.0:3000/authenticate/login"
page = urlli2b.urlopen(page)
soup = BeautifulSoup(page)
token = soup.find("input", { 'name': 'authenticity_token'})

Something like that should work. I didn't test this but you can read the documentation to get it exact.

Bartek
  • 15,269
  • 2
  • 58
  • 65
1

You don't need the findall call. Instead use:

m = re_token.match(line)
if m:
    token = m.group(1)
    ....

I second the recommendation of BeautifulSoup over regular expressions though.

interjay
  • 107,303
  • 21
  • 270
  • 254
1

there's nothing "pythonic" with using regex. If you don't want to use BeautifulSoup(which you should ideally), just use Python's excellent string manipulation capabilities

for line in open("file"):
    line=line.strip()
    if "<input name" in line and "value=" in line:
        item=line.split()
        for i in item:
            if "value" in i:
                print i

output

$ more file
<input name="authenticity_token" type="hidden" value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4=" />
$ python script.py
value="WTumSWohmrxcoiDtgpPRcxUMh/D9m7O7T6HOhWH+Yw4="
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
  • This code is terrible... worse than the original IMHO (though of course an actual parser like BS is the way to go). You should almost never have quad nested statements like this. The original had two, and you doubled it. – Andrew Johnson Nov 09 '09 at 01:56
  • Andyou introduced a bunch of random string literals. – Andrew Johnson Nov 09 '09 at 01:57
  • you should take a look at my output before you comment. I am doing it on a file with only that sample line OP posted, just to show you can just use Python's internal string capabilities without too much regex. What quad nested statements and random string literals are you talking about? If you have a better solution, then please post it out. – ghostdog74 Nov 09 '09 at 02:37
  • You code nests for->if->for->if, and is indented four times. The string literals are " – Andrew Johnson Nov 09 '09 at 03:30
  • So what if its indented for times ?? the first if test for the "almost" exact line to get. then once the line is grabbed, split into items, iterate over them to get "value" (because we don't know where value might be). There's no use of regex in this case. What's wrong with that? Like i already said, OP should use BS if possible, but my solution also applies when doesn't want to use BS. – ghostdog74 Nov 09 '09 at 04:21
0

As to why you shouldn't use regular expressions to search HTML, there are two main reasons.

The first is that HTML is defined recursively, and regular expressions, which compile into stackless state machines, don't do recursion. You can't write a regular expression that can tell, when it encounters an end tag, what start tag it encountered on its way to that tag it belongs to; there's nowhere to save that information.

The second is that parsing HTML (which BeautifulSoup does) normalizes all kinds of things that are allowable in HTML and that you're probably not going to ever consider in your regular expressions. To pick a trivial example, what you're trying to parse:

<input name="authenticity_token" type="hidden" value="xxx"/>

could just as easily be:

<input name='authenticity_token' type="hidden" value="xxx"/>

or

<input type = "hidden" value = "xxx" name = 'authenticity_token' />

or any one of a hundred other permutations that I'm not thinking about right now.

Robert Rossney
  • 94,622
  • 24
  • 146
  • 218