Python regex problem

Question

I'm trying to match some html with a regex, and the regex works fine if like this:

import re

reg = r";!--\"\'<[a-i0-9]{8}>=&\{\(\)\}"

html_data = "some html data"

if re.search(reg, html_data):
    print("Match")

But if it get's the html data either from reading a local file or getting it from the web it fails. I've downloaded the html page from the server, then copy pasted the source into the script and it works fine. But reading directly from file or the server does not work.

I've also checked the local file with a hex editor to verify that there isn't some special char that is screwing me over.

Example of string to be matched:

<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;">

Where ;!--\"\'<a41cgb04>=&{()} is what that should be matched.

Don't use regex to parse HTML! I'm certain Python has a module for parsing it correctly. — Cfreak, Aug 26 '11 at 14:26
*I'm trying to match some html with a regex... - Said the novice programmer unsuspecting of what horror lies beyond his statement.* There should be a movie horror movie about parsing HTML with regex. — Alin Purcaru, Aug 26 '11 at 14:30
Badly formulated by my, I'm matching a string that can appear anywhere in the source, so I can't really use a parse to find it. — Sindre Smistad, Aug 26 '11 at 14:35
Can you post the rest of your code where you're reading in from a file? — , Aug 26 '11 at 14:37
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — andrew cooke, Aug 26 '11 at 16:30
Could you give the adress of a web page with which you obtain the problem, please ? I wish to know how you obtain the sequence of characters you show, in order to reproduce the problem and analyze it more precisely. — eyquem, Aug 26 '11 at 17:11
Wrong! [Regexes](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) [work](http://stackoverflow.com/questions/7181653/regex-matching-tag-names-only-in-html/7182478#7182478) [fine](http://stackoverflow.com/questions/4044946/regex-to-split-html-tags/4045840#4045840) [on](http://stackoverflow.com/questions/4031112/regular-expression-matching/4034386#4034386) [HTML](http://stackoverflow.com/questions/4840988/the-recognizing-power-of-modern-regexes/4843579#4843579). — tchrist, Aug 26 '11 at 17:35

score 2 · Accepted Answer · edited Jun 20 '20 at 09:12

For me, your problem is due to your erroneous interpretation of this:

<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;">

You think that the backslashes in front of " and ' are in the source code. But I think that one of the two is in fact an artefact of displaying: it is not present in the HTML code in reality.

I don't know how you obtain the above sequence of characters.
But I think the phenomenon is the same as the one observed when using repr():
there are backslashes in the display that are used by the displayer to make you understand what is in the sequence of characters, but in reality all the backslashes are not in the value of the string displayed

You'll better understand what I mean with this:

a = "abc ' def "

b = ' ABC " DEF'

print repr(a + b)

result

'abc \' def  ABC " DEF'

.

Update

The following web page as exemple:

http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/

.

Doing 'Display the source code' on this page produces a display in which the 13th line is

<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" />

Now, executing the following code

from urllib import urlopen


url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'

sock = urlopen(url)
srce = sock.read()
sock.close()


li = srce.splitlines(True)

print 'Displayed normally:\n-------------------\n'
print '\n'.join(li[12:14])
print

print 'Displayed with the help of repr():\n----------------------\n'
print '\n'.join(map(repr,li[12:14]))
print

print 'Displayed in a list:\n--------------------\n'
print li[12:14]

produces the result:

Displayed normally:
-------------------

<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in  Bergenia" />

<meta name="allow-search" content="YES" />


Displayed with repr():
----------------------

'<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in  Bergenia" />\n'
'<meta name="allow-search" content="YES" />\n'

Displayed in a list:
--------------------

['<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in  Bergenia" />\n', '<meta name="allow-search" content="YES" />\n']

Displaying the source code normally has a consequence: special character like '\n', '\r' , '\t' are not seen and it isn't easy to write a regex's pattern.
That's why analyzing an HTML source is facilitated with the display of the strings without interpretation.

So, displaying the source code with repr() or in a list shows all the characters explicitly.
The only inconvenience is that sometimes, characters ' in the middle of the string are escaped because it is the way these characters must be defined in a string when this string is defined with quotes ' at the beginning and the end. When a list is displayed, its elements are displayed on the screen with the help of repr(), that why the instruction print li[12:14] displays the elements under the same form than the instruction print '\n'.join(map(repr,li[12:14])). In fact, repr() displays a string having a certain value as this string would be defined to give it the said value.

.

In the end, what I want to underline is that : if someone defines a regex's pattern with "\\\\'" or r"\\'" because he believes that there is a character \ before a character ' because of the display of a source code with repr() , he does incorrect pattern.

The codes that follows explains this better, I hope:

 import re
from urllib import urlopen


url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'

sock = urlopen(url)
srce = sock.read()
sock.close()


pat = '<meta name="abstract" content="(Heronswood Bergenia (\'Lunar Glow\')? [a-zA-Z]+\d+ .*?)" />'
regx = re.compile(pat)
print regx.search(srce).groups()

pat = "<meta name=\"abstract\" content=\"(Heronswood Bergenia (\\\\'Lunar Glow\\\\')? [a-zA-Z]+\d+ .*?)\" />"
regx = re.compile(pat)
print regx.search(srce).groups()

result

("Heronswood Bergenia 'Lunar Glow' PP20247 in  Bergenia", "'Lunar Glow'")

Traceback (most recent call last):
  File "I:\trez.py", line 18, in <module>
    print regx.search(srce).groups()
AttributeError: 'NoneType' object has no attribute 'groups'

score 0 · Answer 2 · answered Aug 26 '11 at 14:28

0

Perhaps this http://docs.python.org/library/htmlparser.html will be more useful to you than trying to use a regex. I tend to agree with Mark Pilgrim that using regex gives you two problems, regex and your original issue.

answered Aug 26 '11 at 14:28

Boogle

33
4

2

I'm not parsing html in that sense, I'm looking for ;!--\"\'<8randomletters>=&{()} and it can appear anywhere in the source. – Sindre Smistad Aug 26 '11 at 14:33
Can't you use the html parser to bring in the source as strings and then search those for the string using data.find()? – Boogle Aug 26 '11 at 14:48
HTML parsers are not solution to all data munging needs. It is harmful and unrealistic to pretend otherwise. Regexes are made for this sort of thing. – tchrist Aug 26 '11 at 17:42

score 0 · Answer 3 · answered Aug 26 '11 at 14:32

0

The backslash character, \, has a special meaning in regular expressions. If you want to match a backslash in the text, you have to write \\ in the regular expression:

reg = r";!--\\"\\'<[a-i0-9]{8}>=&\{\(\)\}"

answered Aug 26 '11 at 14:32

Jason Orendorff

42,793
6
62
96

reg = r";!--\"\'<[a-i0-9]{8}>=&\{\(\)\}" is the same as reg = ";!--\\"\\'<[a-i0-9]{8}>=&\{\(\)\}" – Sindre Smistad Aug 26 '11 at 14:38
@pyCtl_ Yes, but I think you still need two backslashes, or else the regex compiler will eat them. Hm, do you *really* have backslashed double quotes in the doublequoted value of that input widget? – tchrist Aug 26 '11 at 17:41

WombatPM · Answer 4 · 2011-08-26T17:10:27.150

0

I'd change your regex since you are in backslash hell. This expression works using a file.

 reg = ";!--....<[a-i0-9]{8}>=&\{\(\)\}"

In breaking your expression down into parts:

reg = ";!--"  Matches
reg = ";!--\\" throws an error regarding bogus end of line escape.

Python does not like \'s at the end of strings escaped or otherwise.

As the saying goes:
A developer has a problem and thinks "I'll solve it with regular expressions".

Now the developer has two problems.

edited Aug 26 '11 at 17:10

answered Aug 26 '11 at 16:07

WombatPM

2,561
2
22
22

The only reason the developer makes two problems by using regexes is if they are using a tool whose settings they don’t understand. If they actually understand pattern matching, regexes are quite often the cleanest, easiest, clearest, and most maintainable approach to string hacking. Consider the pattern `foo(.*?)bar`, which captures that string into group 1. Now go off and write your tedious imperative version with indexing and substrings. The regex is a lot better, and more efficient, too. – tchrist Aug 26 '11 at 17:46

Python regex problem

4 Answers4

Update

Linked