For me, your problem is due to your erroneous interpretation of this:
<input type="text" value=";!--\"\'<a41cgb04>=&{()}" name="url" maxlength="200" class="url" style="width:495px;">
You think that the backslashes in front of " and ' are in the source code. But I think that one of the two is in fact an artefact of displaying: it is not present in the HTML code in reality.
I don't know how you obtain the above sequence of characters.
But I think the phenomenon is the same as the one observed when using repr():
there are backslashes in the display that are used by the displayer to make you understand what is in the sequence of characters, but in reality all the backslashes are not in the value of the string displayed
You'll better understand what I mean with this:
a = "abc ' def "
b = ' ABC " DEF'
print repr(a + b)
result
'abc \' def ABC " DEF'
.
Update
The following web page as exemple:
http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/
.
Doing 'Display the source code' on this page produces a display in which the 13th line is
<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" />
Now, executing the following code
from urllib import urlopen
url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'
sock = urlopen(url)
srce = sock.read()
sock.close()
li = srce.splitlines(True)
print 'Displayed normally:\n-------------------\n'
print '\n'.join(li[12:14])
print
print 'Displayed with the help of repr():\n----------------------\n'
print '\n'.join(map(repr,li[12:14]))
print
print 'Displayed in a list:\n--------------------\n'
print li[12:14]
produces the result:
Displayed normally:
-------------------
<meta name="abstract" content="Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia" />
<meta name="allow-search" content="YES" />
Displayed with repr():
----------------------
'<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in Bergenia" />\n'
'<meta name="allow-search" content="YES" />\n'
Displayed in a list:
--------------------
['<meta name="abstract" content="Heronswood Bergenia \'Lunar Glow\' PP20247 in Bergenia" />\n', '<meta name="allow-search" content="YES" />\n']
Displaying the source code normally has a consequence: special character like '\n', '\r' , '\t' are not seen and it isn't easy to write a regex's pattern.
That's why analyzing an HTML source is facilitated with the display of the strings without interpretation.
So, displaying the source code with repr() or in a list shows all the characters explicitly.
The only inconvenience is that sometimes, characters ' in the middle of the string are escaped because it is the way these characters must be defined in a string when this string is defined with quotes ' at the beginning and the end. When a list is displayed, its elements are displayed on the screen with the help of repr(), that why the instruction print li[12:14]
displays the elements under the same form than the instruction print '\n'.join(map(repr,li[12:14]))
. In fact, repr() displays a string having a certain value as this string would be defined to give it the said value.
.
In the end, what I want to underline is that :
if someone defines a regex's pattern with "\\\\'"
or r"\\'"
because he believes that there is a character \ before a character ' because of the display of a source code with repr() , he does incorrect pattern.
The codes that follows explains this better, I hope:
import re
from urllib import urlopen
url = 'http://www.heronswood.com/perennials_bergenia/bergenia-lunar-glow/'
sock = urlopen(url)
srce = sock.read()
sock.close()
pat = '<meta name="abstract" content="(Heronswood Bergenia (\'Lunar Glow\')? [a-zA-Z]+\d+ .*?)" />'
regx = re.compile(pat)
print regx.search(srce).groups()
pat = "<meta name=\"abstract\" content=\"(Heronswood Bergenia (\\\\'Lunar Glow\\\\')? [a-zA-Z]+\d+ .*?)\" />"
regx = re.compile(pat)
print regx.search(srce).groups()
result
("Heronswood Bergenia 'Lunar Glow' PP20247 in Bergenia", "'Lunar Glow'")
Traceback (most recent call last):
File "I:\trez.py", line 18, in <module>
print regx.search(srce).groups()
AttributeError: 'NoneType' object has no attribute 'groups'